Position Details
About this role
This role involves developing and optimizing distributed systems for AI workloads, focusing on performance, scalability, and cloud integration within NVIDIA's DGX Cloud platform.
Key Responsibilities
- Drive performance and scalability testing
- Collaborate with AI researchers and developers
- Debug and optimize Kubernetes clusters
- Develop monitoring tools
- Engage with open-source communities
Technical Overview
The technical environment includes Kubernetes, open-source tools, cloud platforms like GCP, AWS, Azure, OCI, and programming in Golang and Python, with a focus on high-performance distributed systems.
Ideal Candidate
The ideal candidate is a senior systems software engineer with at least 8 years of experience in distributed systems, open-source technologies like Kubernetes, and performance optimization. They possess deep expertise in cloud platforms and programming in Golang and Python, with a strong background in scaling AI infrastructure.
Must-Have Skills
Nice-to-Have Skills
Tools & Platforms
Required Skills
Hard Skills
Soft Skills
Industry & Role
Keywords for Your Resume
Deal Breakers
Less than 8 years of experience, Lack of expertise in Kubernetes, No experience with large-scale distributed systems, Proficiency only in non-relevant programming languages, Location outside Santa Clara, CA without remote options
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile