Position Details
About this role
This role involves supporting and automating high-performance AI infrastructure using modern DevOps practices. The engineer will manage NVIDIA DGX and Cisco-UCS systems, ensuring scalability, reliability, and performance.
Key Responsibilities
- Automate AI platform pipelines
- Support NVIDIA DGX and Cisco-UCS infrastructure
- Ensure system scalability and reliability
- Drive capacity planning
- Implement monitoring and fault-tolerance
Technical Overview
The environment includes Linux-based HPC clusters, Kubernetes, Docker, Terraform, Ansible, Jenkins, and cloud infrastructure. The focus is on automation, system reliability, and infrastructure scaling.
Ideal Candidate
The ideal candidate is a mid-level site reliability engineer with 5+ years of experience in Linux, Kubernetes, and automation tools like Terraform and Ansible. They are proficient in scripting languages such as Python and Go and have experience supporting high-performance compute infrastructure.
Must-Have Skills
Nice-to-Have Skills
Tools & Platforms
Required Skills
Hard Skills
Soft Skills
Certifications
Preferred
Industry & Role
Keywords for Your Resume
Deal Breakers
Lack of experience with Linux or Kubernetes, No scripting experience in Python or Go, Unfamiliarity with CI/CD pipelines, No experience with high-performance compute infrastructure
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile