Position Details
About this role
Principal Site Reliability Engineer to provide technical leadership for AI infrastructure operations, focusing on reliability strategy, large-scale control-plane systems, and cross-team improvements.
Key Responsibilities
- Own reliability strategy
- Design control-plane systems
- Define SLOs and best practices
- Act as escalation point
- Mentor engineers
Technical Overview
Role emphasizes building production-grade automation and reliability tooling for GPU/AI infrastructure, with strong emphasis on SLOs, MTTR, cost efficiency, and Kubernetes at scale.
Ideal Candidate
A highly experienced SRE leader (10+ years) with deep expertise in Linux, networking, distributed systems, and Kubernetes at scale. Strong focus on reliability, observability, and cross-team collaboration in AI/HPC environments.
Must-Have Skills
None listed
Nice-to-Have Skills
Tools & Platforms
Required Skills
Hard Skills
Soft Skills
Industry & Role
Keywords for Your Resume
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile