✦ Luna Orbit — DevOps & SRE

Manager of Reliability Operations

at Nexcess

📍 Remote, US Remote 💰 $110K – $150K USD / year Posted April 07, 2026
Salary $110K – $150K USD / year
Type Full-Time
Experience lead
Exp. Years 7+ years
Education Bachelor’s degree in Computer Science, Engineering, or related field
Category DevOps & SRE

Manager of Reliability Operations leading incident response, observability, and platform reliability for a multi-product hosting platform; owns on-call rotations, post-incident reviews, and reliability strategy.

  • Own reliability operations & incident command
  • Establish incident standards & escalation
  • Lead major incident responses in a 24/7 environment
  • Build/manage on-call rotations
  • Translate incident trends into reliability improvements

Hands-on with Linux, VMware, Ceph, and cloud platforms; implements monitoring using Datadog/Prometheus/Grafana/New Relic; manages incident tooling (PagerDuty/Opsgenie) and drives capacity planning and lifecycle decisions.

The ideal candidate is a senior platform engineering/DevOps leader with 7+ years in reliability operations, capable of owning incident command, improving observability, and leading cross-functional teams in a 24/7 production environment.

Bachelor’s degree in Computer ScienceEngineeringor a related field7+ experience in systems operationssite reliabilityor platform engineering2+ years experience leading teams or major operational functionsProven experience managing incidents in a 24/7 production environmentStrong background in troubleshootingroot cause analysisand operational improvementExperience with change management practices
Background in managed hostingcloud infrastructureor SaaS environmentsExperience defining and tracking system reliability and performance targetsITIL familiarityVMwareCephLinuxand Windows platforms
DatadogPrometheusGrafanaNew RelicPagerDutyOpsgenieVMwareCeph
Platform engineeringSite reliabilityIncident managementChange managementObservabilityDatadogPrometheusGrafanaNew RelicPagerDutyOpsgenie
DatadogPrometheusGrafanaNew RelicPagerDutyOpsgenieLinuxVMwareCephcloud platformscentralized loggingmetricstracingincident management
strong communicationleadershipability to perform under pressurecollaborationproblem solving

Preferred

AWSRHCE
Industry Cloud & Infrastructure
Job Function Oversee reliability operations and incident management to improve platform resilience.
Role Subtype Platform Engineer
Tech Domains Linux, VMware, Ceph, Cloud Platforms, Datadog, Prometheus, Grafana, New Relic
Manager of Reliability OperationsReliability OperationsIncident CommandPost-incident reviewsObservabilityDatadogPrometheusGrafanaNew RelicPagerDutyOpsgenieLinuxVMwareCephcloud platformscentralized loggingmetricstracingSite Reliability EngineerPlatform Engineerreliability operationsincident commandobservabilitydatadogprometheusgrafanapagerdutyopsgenielinuxvmware

Bachelor’s degree required, 7+ years in systems operations or platform engineering, Experience leading teams, Experience in 24/7 production environments

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile