Position Details
About this role
Manager of Reliability Operations leading incident response, observability, and platform reliability for a multi-product hosting platform; owns on-call rotations, post-incident reviews, and reliability strategy.
Key Responsibilities
- Own reliability operations & incident command
- Establish incident standards & escalation
- Lead major incident responses in a 24/7 environment
- Build/manage on-call rotations
- Translate incident trends into reliability improvements
Technical Overview
Hands-on with Linux, VMware, Ceph, and cloud platforms; implements monitoring using Datadog/Prometheus/Grafana/New Relic; manages incident tooling (PagerDuty/Opsgenie) and drives capacity planning and lifecycle decisions.
Ideal Candidate
The ideal candidate is a senior platform engineering/DevOps leader with 7+ years in reliability operations, capable of owning incident command, improving observability, and leading cross-functional teams in a 24/7 production environment.
Must-Have Skills
Nice-to-Have Skills
Tools & Platforms
Required Skills
Hard Skills
Soft Skills
Certifications
Preferred
Industry & Role
Keywords for Your Resume
Deal Breakers
Bachelor’s degree required, 7+ years in systems operations or platform engineering, Experience leading teams, Experience in 24/7 production environments
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile