Service Management Engineer (m/f/d) - full-time

Munich

About us

Do you love stories? If so, please keep reading, because we certainly do. We believe the ability to tell stories is what makes us human. Joyn is your streaming app with over 50 live TV channels, exclusive previews, originals and collections. We understand Joyn as a partnership – an invitation to content-providers and users alike to make entertainment more meaningful and fun. Our app aggregates global and especially local content in a relevant way for Germany, both live TV and on-demand content. All kinds of stories and more to come, everyday.

We hire the best, because we need people that are as customer-focused as we are. We are looking for champions to help us further connect with our audience. It’s not a small or easy task, but it’s a fun and rewarding one. Do you think you’re up for it? Great. Then send us your application!

About the Job

We are looking for a Service Management Engineer to help establish and continuously improve standards and processes around service quality and incident handling of the next generation streaming platform for the German market. Together with the SRE and SecOps team and the Engineering teams, you are developing a blameless culture and willingness to detect risks and mitigate them in our products and services. Our mission is to offer to our clients, Joyn customers, and engineering teams, a smooth experience based on the services KPIs. 

You will work cross-functional with our 20+ autonomous engineering teams and will be the advocate for service quality excellence. Together with the teams, you will drive the incident management processes like postmortems, SLA agreement checks, incident reports, and many more and determine the improvement areas and possible risks.

Some of your allies: Your team (SRE, SecOps and Service Management), Dynatrace, Prometheus, Opsgenie, Grafana, Incident Management Process, High Availability Systems Best Practices, SLA, SLO, SLI Concepts, AWS Cloudwatch Metrics and Logs, ELK stack, Cost Monitoring, Data Analytics Frameworks and many more.

What You Will Do

  • Design and improve our incident management lifecycle to identify, mitigate, and learn from incidents.
  • Define the service level metrics (SLA, SLO, SLI concept) for our services together with the engineering teams.
  • Develop and build tools to measure the performance of our systems based on the service level agreements and KPIs.
  • Create dashboards available for the engineering teams and the management team with all the important metrics, customer and services KPIs, cost optimization, etc.
  • Evaluate the risks and the quality of our services and suggest solutions in terms of reliability, scalability, and observability.
  • Help and align with leadership and engineering teams on the OnCall model.
  • Practice sustainable incident management in a blameless culture; coordinate the analysis, troubleshooting, and resolution of system issues.
  • Present the reports and recommendations based on our observability tools, dashboards and the results of your work to the engineering and management teams.
  • Develop documentation and training to onboard and level up the incident management knowledge and blameless culture within the organization
  • How You Will Do It

  • You enjoy solving difficult technical problems in the team.
  • You design and improve the incident management lifecycle and present it to the engineering teams.
  • Together with your team, you support the engineering teams by offering guidance and solutions for production releases, observability and reliability of their services.
  • Together with the engineering team, you will drive the incident management processes like postmortems, follow-up meetings, risk analysis, incident reports, and many more.
  • You will develop tools to measure the service level agreements, clients KPIs and services KPI related to reliability and availability, cost monitoring, and many more.
  • We like you to take ownership of the tools that you are building and work with your colleagues to deliver a reliable, monitored, and highly available solution.
  • You learn from both success and failure, actively coach, and get coached by the team.
  • What We Are Looking For

  • Strong knowledge of ITIL and/or Service Management methodologies.
  • Experience in DevOps or Site Reliability Engineer or Development roles.
  • Understand monitoring systems like Prometheus, Dynatrace, and ELK stack.
  • Experience with a cloud provider, AWS preferred or distributed systems.
  • Scripting knowledge in any modern language like Python, Golang, Java, NodeJs, Ruby is a plus.
  • Good understanding of the software development life cycle process.
  • Possess client-facing skills to be able to deal with and form good relationships with the business and other technology groups through the day to day support and project work.
  • Solid analytical and problem-solving skills with an appreciation of technical risks.
  • University degree in computer science, information technology, media engineering, or equivalent.
  • Good written and verbal communication skills - English is our team language.