Position Details
About this role
Principal Site Reliability Engineer to provide technical leadership across AI Infrastructure Operations, designing large-scale control-plane systems, defining SLOs, and driving cross-functional reliability improvements at scale in a remote-first environment.
Key Responsibilities
- Own reliability strategy; design control-plane systems; define SLO frameworks; mentor engineers; drive cross-functional reliability improvements
Technical Overview
Expert in Linux, networking, distributed systems, GPU/AI workloads, Kubernetes at scale, SLURM, InfiniBand/RDMA, and observability tooling to improve availability, MTTR, and cost efficiency.
Ideal Candidate
The ideal candidate is a Principal SRE with 10+ years of experience in large-scale systems, deep Linux, networking, and distributed systems expertise, and a track record of leading initiatives across teams in a remote environment. Experience with AI/HPC platforms, GPUs, and Kubernetes at scale is highly valued.
Must-Have Skills
Nice-to-Have Skills
Tools & Platforms
Required Skills
Hard Skills
Soft Skills
Industry & Role
Keywords for Your Resume
Deal Breakers
Lack of 10+ years SRE experience, Inadequate Kubernetes or distributed systems experience, Inability to work in a remote-first environment
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile