Position Details
About this role
This role involves building and maintaining a high-availability AI Data Center platform, focusing on telemetry ingestion, automation, and reliability engineering.
Key Responsibilities
- Monitor platform health
- Own Kubernetes deployments
- Lead incident triage
- Build runbooks and SOPs
- Manage deployment infrastructure
Technical Overview
Environment includes Kubernetes, Terraform, Helm, scripting in Python and Bash, with a focus on observability, incident management, and platform automation.
Ideal Candidate
The ideal candidate is a senior DevOps engineer with over 5 years of experience managing production distributed systems, with deep expertise in Kubernetes, infrastructure automation, and observability tools. They should be proactive in incident management and continuous improvement of platform reliability.
Must-Have Skills
Nice-to-Have Skills
Tools & Platforms
Required Skills
Hard Skills
Soft Skills
Industry & Role
Keywords for Your Resume
Deal Breakers
Less than 5 years of relevant experience, Lack of Kubernetes or container experience, No scripting or automation skills, No experience with infrastructure as code tools
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile