Position Details

Salary $110K – $150K USD / year

Type Full-Time

Experience lead

Exp. Years 7+ years

Education Bachelor’s degree in Computer Science, Engineering, or related field

Category DevOps & SRE

About this role

Manager of Reliability Operations leading incident response, observability, and platform reliability for a multi-product hosting platform; owns on-call rotations, post-incident reviews, and reliability strategy.

Key Responsibilities

Own reliability operations & incident command
Establish incident standards & escalation
Lead major incident responses in a 24/7 environment
Build/manage on-call rotations
Translate incident trends into reliability improvements

Technical Overview

Hands-on with Linux, VMware, Ceph, and cloud platforms; implements monitoring using Datadog/Prometheus/Grafana/New Relic; manages incident tooling (PagerDuty/Opsgenie) and drives capacity planning and lifecycle decisions.

Ideal Candidate

The ideal candidate is a senior platform engineering/DevOps leader with 7+ years in reliability operations, capable of owning incident command, improving observability, and leading cross-functional teams in a 24/7 production environment.

Must-Have Skills

Bachelor’s degree in Computer ScienceEngineeringor a related field7+ experience in systems operationssite reliabilityor platform engineering2+ years experience leading teams or major operational functionsProven experience managing incidents in a 24/7 production environmentStrong background in troubleshootingroot cause analysisand operational improvementExperience with change management practices

Nice-to-Have Skills

Background in managed hostingcloud infrastructureor SaaS environmentsExperience defining and tracking system reliability and performance targetsITIL familiarityVMwareCephLinuxand Windows platforms

Tools & Platforms

DatadogPrometheusGrafanaNew RelicPagerDutyOpsgenieVMwareCeph

Required Skills

Platform engineeringSite reliabilityIncident managementChange managementObservabilityDatadogPrometheusGrafanaNew RelicPagerDutyOpsgenie

Hard Skills

DatadogPrometheusGrafanaNew RelicPagerDutyOpsgenieLinuxVMwareCephcloud platformscentralized loggingmetricstracingincident management

Soft Skills

strong communicationleadershipability to perform under pressurecollaborationproblem solving

Certifications

Preferred

AWSRHCE

Industry & Role

Industry Cloud & Infrastructure

Job Function Oversee reliability operations and incident management to improve platform resilience.

Role Subtype Platform Engineer

Tech Domains Linux, VMware, Ceph, Cloud Platforms, Datadog, Prometheus, Grafana, New Relic

Keywords for Your Resume

Manager of Reliability OperationsReliability OperationsIncident CommandPost-incident reviewsObservabilityDatadogPrometheusGrafanaNew RelicPagerDutyOpsgenieLinuxVMwareCephcloud platformscentralized loggingmetricstracingSite Reliability EngineerPlatform Engineerreliability operationsincident commandobservabilitydatadogprometheusgrafanapagerdutyopsgenielinuxvmware

Deal Breakers

Bachelor’s degree required, 7+ years in systems operations or platform engineering, Experience leading teams, Experience in 24/7 production environments

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile