Position Details

Salary $172K – $313K USD / year

Type Not Specified

Experience senior

Exp. Years Not specified

Education Not specified

Category AI & Machine Learning

About this role

This role involves designing and implementing scalable AI and ML infrastructure at NVIDIA, focusing on containerized environments, high-performance computing, and distributed systems to support AI-powered applications.

Key Responsibilities

Design and build containers for NIM runtimes
Develop tooling for build orchestration and CI/CD
Optimize container performance and scalability
Collaborate across teams for deployment
Mentor teammates

Technical Overview

The technical environment includes Kubernetes, GPU infrastructure, Python, containerization, and open-source ML stacks, aimed at building reliable, scalable AI platforms for inference and training.

Ideal Candidate

The ideal candidate is a senior AI/ML engineer with extensive experience in building scalable ML infrastructure, proficient in Kubernetes, GPU computing, and Python, with a strong understanding of distributed systems and AI platform development.

Must-Have Skills

KubernetesGPU infrastructurePythonDistributed systemsML model deploymentModel trainingInferenceHigh performance platforms

Nice-to-Have Skills

vLLMKubeRayOpen source ML stacksScalable ML systemsAI platform development

Tools & Platforms

KubernetesvLLMKubeRayOpen source ML stacksGPU infrastructure

Required Skills

Machine LearningML InfrastructureKubernetesGPUPythonDistributed SystemsModel TrainingModel DeploymentInferenceMonitoring

Hard Skills

Machine LearningML InfrastructureKubernetesGPUvLLMKubeRayPythonOpen source ML stacksModel trainingModel deploymentInferenceMonitoringDistributed systemsHigh performance platforms

Soft Skills

CollaborationCommunicationProblem-solvingArchitectural decision-makingTeamwork

Industry & Role

Industry SaaS

Job Function Developing scalable container and cloud infrastructure for AI/ML applications

Role Subtype AI & Machine Learning

Tech Domains Kubernetes, Python, GPU, Active Directory, Linux

Keywords for Your Resume

machine learningML infrastructureKubernetesGPUmodel trainingmodel deploymentinferencemonitoringdistributed systemshigh performance platformsPythonAI platformSlack AIML lifecyclescalable systemslarge scale systemsAI-powered applicationsML pipelinesmodel servingmodel inferenceGPU infrastructureML deployment

Deal Breakers

Lack of experience with Kubernetes, No background in GPU infrastructure, Insufficient Python skills, No experience with distributed systems, Lack of AI/ML platform development experience

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile

Senior/Staff Software Engineer- Machine Learning Infrastructure, Slack

Get matched to jobs like this