✦ Luna Orbit — DevOps & SRE

Principal Site Reliability Engineer - AI Infrastructure Operations

at nSCALE

📍 Remote, US Remote 💰 $150K – $2150K USD / year Posted April 09, 2026
Salary $150K – $2150K USD / year
Type Full-Time
Experience lead
Exp. Years 10+ years
Education Not Specified
Category DevOps & SRE

Principal Site Reliability Engineer to provide technical leadership for AI infrastructure operations, focusing on reliability strategy, large-scale control-plane systems, and cross-team improvements.

  • Own reliability strategy
  • Design control-plane systems
  • Define SLOs and best practices
  • Act as escalation point
  • Mentor engineers

Role emphasizes building production-grade automation and reliability tooling for GPU/AI infrastructure, with strong emphasis on SLOs, MTTR, cost efficiency, and Kubernetes at scale.

A highly experienced SRE leader (10+ years) with deep expertise in Linux, networking, distributed systems, and Kubernetes at scale. Strong focus on reliability, observability, and cross-team collaboration in AI/HPC environments.

None listed

AI or HPC platformsGPUsInfiniBand/RDMAStep FunctionsServerlessKubernetes at scale
KubernetesInfiniBandRDMASLURMDocker
SRELinuxNetworkingdistributed systemsKubernetesobservabilitycontrol-plane systemsautomation frameworksSLURMInfiniBandRDMAGPU
SRESite Reliability EngineerLinuxNetworkingDistributed systemsControl-plane systemsAutomation frameworksObservability systemsKubernetes at scaleHybrid or bare-metal cloudWorkload schedulers (SLURM)
systems-thinkingleadershipmentoringcross-functional collaboration
Industry Technology
Job Function Lead reliability and automation efforts across AI/HPC infrastructure, ensuring scalable and cost-efficient operations
Role Subtype principal site reliability engineer
Tech Domains Linux, Kubernetes, InfiniBand, RDMA, SLURM, GPU, Cloud Infrastructure, Observability
Principal Site Reliability EngineerSREAI infrastructurelarge-scale control-plane systemsautomation frameworksoperational toolingreliability standardsSLO frameworksobservability systemsKubernetes at scalehybrid or bare-metal cloudworkload schedulersSLURMInfiniBandRDMAGPULinuxdistributed systemsincidentscost efficiencyMTTRremote-firstSite Reliability EngineerKubernetesobservabilityautomation
Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile