Position Details
About this role
This role involves designing and building scalable observability systems to monitor and diagnose complex AI infrastructure, ensuring reliability and operational excellence.
Key Responsibilities
- Design scalable telemetry pipelines
- Own observability platforms
- Build instrumentation libraries
- Drive alerting and SLO infrastructure
- Partner with teams for solutions
Technical Overview
The technical environment includes building telemetry pipelines, working with metrics, logs, traces, error analytics, and distributed systems across cloud platforms like AWS, GCP, and Azure.
Ideal Candidate
The ideal candidate is a highly experienced software engineer with over 10 years in building and maintaining large-scale observability and monitoring systems. They possess deep expertise in metrics, logging, tracing, and error analytics, with a strong background in distributed systems and cloud platforms.
Must-Have Skills
Nice-to-Have Skills
Tools & Platforms
Required Skills
Hard Skills
Soft Skills
Industry & Role
Keywords for Your Resume
Deal Breakers
Less than 10 years of relevant experience, Lack of experience with large-scale observability systems, No familiarity with distributed systems or cloud platforms
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile