Position Details

Salary $165K – $241K USD / year

Type Not Specified

Experience mid

Exp. Years 5+ years

Education Bachelor's degree in Computer Science, Information Technology, or related field

Category DevOps & SRE

About this role

This role involves supporting and automating high-performance AI infrastructure using modern DevOps practices. The engineer will manage NVIDIA DGX and Cisco-UCS systems, ensuring scalability, reliability, and performance.

Key Responsibilities

Automate AI platform pipelines
Support NVIDIA DGX and Cisco-UCS infrastructure
Ensure system scalability and reliability
Drive capacity planning
Implement monitoring and fault-tolerance

Technical Overview

The environment includes Linux-based HPC clusters, Kubernetes, Docker, Terraform, Ansible, Jenkins, and cloud infrastructure. The focus is on automation, system reliability, and infrastructure scaling.

Ideal Candidate

The ideal candidate is a mid-level site reliability engineer with 5+ years of experience in Linux, Kubernetes, and automation tools like Terraform and Ansible. They are proficient in scripting languages such as Python and Go and have experience supporting high-performance compute infrastructure.

Must-Have Skills

LinuxPythonGoKubernetesTerraformAnsibleJenkinsCI/CD

Nice-to-Have Skills

Hybrid cloudVirtualizationCloud infrastructureJiraRa (assuming Jira or similar)

Tools & Platforms

NVIDIA DGXCisco-UCSKubernetesDockerTerraformAnsibleJenkinsGit

Required Skills

pythongoc/c++linuxkubernetesdockerterraformansiblejenkinsgitci/cdcloud computingvirtualizationmonitoringcapacity planning

Hard Skills

PythonGoC/C++LinuxKubernetesDockerTerraformAnsibleJenkinsGitCI/CDCloud computingVirtualizationMonitoringCapacity planning

Soft Skills

AutomationProblem-solvingCollaborationPerformance analysisTroubleshooting

Certifications

Preferred

Linux certificationsCloud certifications

Industry & Role

Industry Information Technology / Cloud & Infrastructure

Job Function Support and automate AI infrastructure for high-performance compute systems

Role Subtype Site Reliability Engineer

Tech Domains Linux, Kubernetes, Docker, Terraform, Ansible, Jenkins, Cloud computing

Keywords for Your Resume

pythongoc/c++linuxkubernetesdockerterraformansiblejenkinsgitci/cdcloud computingvirtualizationmonitoringcapacity planningsite reliability engineersrehybrid cloudautomation

Deal Breakers

Lack of experience with Linux or Kubernetes, No scripting experience in Python or Go, Unfamiliarity with CI/CD pipelines, No experience with high-performance compute infrastructure

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile

AI Infrastructure Site Reliability Engineer (remote USA)

Get matched to jobs like this