Position Details

Type Not Specified

Experience senior

Exp. Years 5+ years

Education BS/MS in CS/CE or equivalent experience

Category DevOps & SRE

About this role

This role involves building and maintaining a high-availability AI Data Center platform, focusing on telemetry ingestion, automation, and reliability engineering.

Key Responsibilities

Monitor platform health
Own Kubernetes deployments
Lead incident triage
Build runbooks and SOPs
Manage deployment infrastructure

Technical Overview

Environment includes Kubernetes, Terraform, Helm, scripting in Python and Bash, with a focus on observability, incident management, and platform automation.

Ideal Candidate

The ideal candidate is a senior DevOps engineer with over 5 years of experience managing production distributed systems, with deep expertise in Kubernetes, infrastructure automation, and observability tools. They should be proactive in incident management and continuous improvement of platform reliability.

Must-Have Skills

5+ years operating production distributed systemsKubernetes + containers experienceSLOs/SLIs ownershipScripting (Python/Bash)Terraform + HelmIncident triageMonitoring and observabilityReliability engineering

Nice-to-Have Skills

Experience with GPU telemetryExperience with AI Data Center platformsCanary deploymentsPost-deployment validation

Tools & Platforms

KubernetesTerraformHelmTerraformHelmLogs/metrics dashboards

Required Skills

KubernetesTerraformHelmPythonBashCI/CDMonitoringTelemetryIncident ResponseReliability

Hard Skills

KubernetesK8sDockerPythonBashTerraformHelmCI/CDInfrastructure as CodeIaCMonitoringLoggingTelemetryIncident ResponsePostmortems

Soft Skills

CommunicationProblem-solvingTeamworkAutomation mindsetDocumentation skills

Industry & Role

Industry Technology

Job Function Operate and maintain a scalable, reliable AI Data Center platform

Keywords for Your Resume

DevOps EngineerKubernetesK8sTerraformHelmCI/CDPythonBashInfrastructure as CodeMonitoringTelemetryIncident ResponsePostmortemsReliabilitySLOsSLIs

Deal Breakers

Less than 5 years of relevant experience, Lack of Kubernetes or container experience, No scripting or automation skills, No experience with infrastructure as code tools

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile

Senior DevOps Engineer, AIOPs

Get matched to jobs like this