Position Details

Salary $150K – $2150K USD / year

Type Full-Time

Experience lead

Exp. Years 10+ years

Education Not Specified

Category DevOps & SRE

About this role

Principal Site Reliability Engineer to provide technical leadership for AI infrastructure operations, focusing on reliability strategy, large-scale control-plane systems, and cross-team improvements.

Key Responsibilities

Own reliability strategy
Design control-plane systems
Define SLOs and best practices
Act as escalation point
Mentor engineers

Technical Overview

Role emphasizes building production-grade automation and reliability tooling for GPU/AI infrastructure, with strong emphasis on SLOs, MTTR, cost efficiency, and Kubernetes at scale.

Ideal Candidate

A highly experienced SRE leader (10+ years) with deep expertise in Linux, networking, distributed systems, and Kubernetes at scale. Strong focus on reliability, observability, and cross-team collaboration in AI/HPC environments.

Must-Have Skills

None listed

Nice-to-Have Skills

AI or HPC platformsGPUsInfiniBand/RDMAStep FunctionsServerlessKubernetes at scale

Tools & Platforms

KubernetesInfiniBandRDMASLURMDocker

Required Skills

SRELinuxNetworkingdistributed systemsKubernetesobservabilitycontrol-plane systemsautomation frameworksSLURMInfiniBandRDMAGPU

Hard Skills

SRESite Reliability EngineerLinuxNetworkingDistributed systemsControl-plane systemsAutomation frameworksObservability systemsKubernetes at scaleHybrid or bare-metal cloudWorkload schedulers (SLURM)

Soft Skills

systems-thinkingleadershipmentoringcross-functional collaboration

Industry & Role

Industry Technology

Job Function Lead reliability and automation efforts across AI/HPC infrastructure, ensuring scalable and cost-efficient operations

Role Subtype principal site reliability engineer

Tech Domains Linux, Kubernetes, InfiniBand, RDMA, SLURM, GPU, Cloud Infrastructure, Observability

Keywords for Your Resume

Principal Site Reliability EngineerSREAI infrastructurelarge-scale control-plane systemsautomation frameworksoperational toolingreliability standardsSLO frameworksobservability systemsKubernetes at scalehybrid or bare-metal cloudworkload schedulersSLURMInfiniBandRDMAGPULinuxdistributed systemsincidentscost efficiencyMTTRremote-firstSite Reliability EngineerKubernetesobservabilityautomation

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile

Principal Site Reliability Engineer - AI Infrastructure Operations

Get matched to jobs like this