✦ Luna Orbit — DevOps & SRE

Principal Site Reliability Engineer - AI Infrastructure Operations

at nSCALE

📍 Remote, US Remote 💰 $150K – $2150K USD / year Posted April 09, 2026
Salary $150K – $2150K USD / year
Type Full-Time
Experience lead
Exp. Years 10+ years
Category DevOps & SRE

Principal Site Reliability Engineer to provide technical leadership across AI Infrastructure Operations, designing large-scale control-plane systems, defining SLOs, and driving cross-functional reliability improvements at scale in a remote-first environment.

  • Own reliability strategy; design control-plane systems; define SLO frameworks; mentor engineers; drive cross-functional reliability improvements

Expert in Linux, networking, distributed systems, GPU/AI workloads, Kubernetes at scale, SLURM, InfiniBand/RDMA, and observability tooling to improve availability, MTTR, and cost efficiency.

The ideal candidate is a Principal SRE with 10+ years of experience in large-scale systems, deep Linux, networking, and distributed systems expertise, and a track record of leading initiatives across teams in a remote environment. Experience with AI/HPC platforms, GPUs, and Kubernetes at scale is highly valued.

10+ years of experience in Site Reliability EngineeringSystems Engineeringor Software Engineering rolesExpert-level software engineering skills with a track record of production-grade automationLinux expertiseNetworking and distributed systems design at scaleAbility to lead technical initiatives across teams without direct authorityStrong systems-thinking mindset
AI or HPC platforms experience (GPUsInfiniBand/RDMA)Workflow schedulers (e.g.SLURM)Kubernetes at scaleHybrid or bare-metal cloud architecturesObservability design for high-cardinality environments
KubernetesSLURMInfiniBandRDMAGitTerraform
SRELinuxNetworkingDistributed systemsObservabilityAI infrastructureGPUsInfiniBand/RDMASLURM
SRELinuxNetworkingDistributed systemsAI / HPC platformsGPUsInfiniBand/RDMAWorkload schedulers (SLURM)Kubernetes at scaleHybrid or bare-metal cloud architecturesObservability systemsHigh cardinality / high throughput environments
Systems-thinkingInfluence without authorityMentoringCross-functional collaborationCommunication
Industry AI & Machine Learning
Job Function Provide technical leadership for reliability across GPU/AI infrastructure platforms
Role Subtype Principal Site Reliability Engineer
Tech Domains Linux, Kubernetes, InfiniBand, RDMA, SLURM, Cloud infrastructure, Hybrid cloud, Monitoring
principal site reliability engineersreai infrastructuregpuinfiniBandrdmaslurmkubernetesbare-metal cloudhybrid cloudobservabilitymttrautomationcontrol-planesre toolingcloud costsnetworkinglinuxdistributed systemsSRELinuxKubernetesSLURMInfiniBandRDMAMTTRGPUAI infrastructure

Lack of 10+ years SRE experience, Inadequate Kubernetes or distributed systems experience, Inability to work in a remote-first environment

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile