About this role
This role involves driving performance analysis, optimization, and modeling for NVIDIA's DGX Cloud AI infrastructure, focusing on large-scale parallel and distributed systems.
Key Responsibilities
- Develop benchmarks and applications
- Analyze performance bottlenecks
- Optimize workloads
- Collaborate with cross-functional teams
- Develop modeling frameworks
Technical Overview
The technical environment includes high-performance AI workloads, large-scale parallel and distributed systems, AI frameworks like PyTorch and TensorFlow, and cloud platforms such as GCP, AWS, Azure, and OCI.
Ideal Candidate
The ideal candidate is a senior AI & Machine Learning engineer with extensive experience in large-scale parallel and distributed systems, performance optimization, and AI workloads. They possess a strong background in computer architecture, networking, and AI frameworks, with over 10 years of relevant experience.
Must-Have Skills
Expertise in large scale parallel and distributed accelerator-based systemsOptimizing performance and AI workloadsPerformance modeling and benchmarkingStrong background in Computer ArchitectureNetworkingStorage systemsExperience with AI frameworks (PyTorchTensorFlowJAXMegatron-LMTensort-LLMVLLM)Experience with AI/ML models and workloadsincluding LLMs and DNNs10 years experience in relevant areasProficiency in PythonC/C++Experience with public cloud infrastructure (GCPAWSAzureOCI)
Nice-to-Have Skills
Experience with performance optimizationExperience with cloud infrastructure designModeling frameworksTotal Cost of Ownership analysis
Tools & Platforms
PyTorchTensorFlowJAXMegatron-LMTensort-LLMVLLMGCPAWSAzureOCI
Required Skills
Parallel and Distributed SystemsPerformance analysisPerformance modelingBenchmarkingComputer ArchitectureNetworkingStorage systemsAcceleratorsPyTorchTensorFlowJAXMegatron-LMTensort-LLMVLLMAI/ML modelsLarge Language ModelsDeep Neural NetworksPythonC/C++GCPAWSAzureOCI
Hard Skills
Parallel and Distributed SystemsPerformance analysisPerformance modelingBenchmarkingComputer ArchitectureNetworkingStorage systemsAcceleratorsPyTorchTensorFlowJAXMegatron-LMTensort-LLMVLLMAI/ML modelsLarge Language ModelsDeep Neural NetworksPythonC/C++Public Cloud InfrastructureGoogle Cloud PlatformAmazon Web ServicesAzureOracle Cloud Infrastructure
Soft Skills
collaborationcommunicationproblem-solvinganalytical thinkingteamwork
Keywords for Your Resume
Parallel and Distributed SystemsPerformance analysisPerformance modelingBenchmarkingComputer ArchitectureNetworkingStorage systemsAcceleratorsPyTorchTensorFlowJAXMegatron-LMTensort-LLMVLLMAI/ML modelsLarge Language ModelsDeep Neural NetworksPythonC/C++GCPAWSAzureOCIAI workloadsCloud Infrastructure
Deal Breakers
Less than 10 years of experience, Lack of experience with large-scale parallel systems, No proficiency in Python or C/C++, No experience with AI frameworks or cloud infrastructure
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile