Position Details
About this role
This role involves debugging and optimizing large-scale multi-tenant AI datacenter clusters, focusing on Kubernetes, Slurm, and GPU integration.
Key Responsibilities
- Debug multi-tenant clusters
- Prototype stack enhancements
- Collaborate on architecture reviews
- Create testbeds and automation
- Present at conferences and customer sites
Technical Overview
Technical environment includes Kubernetes internals, Slurm workload manager, container runtimes, RDMA/IB fabric, and GPU accelerators, with a focus on debugging and system optimization.
Ideal Candidate
The ideal candidate is a senior software engineer with over 6 years of experience in Kubernetes internals, Slurm, and cloud-native stack debugging, with a focus on GPU integration and large-scale systems.
Must-Have Skills
Nice-to-Have Skills
Tools & Platforms
Required Skills
Hard Skills
Soft Skills
Industry & Role
Keywords for Your Resume
Deal Breakers
Less than 6 years of experience, Lack of expertise in Kubernetes internals or Slurm, No experience with GPU computing
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile