✦ Luna Orbit — AI & Machine Learning

Senior DGX Cloud AI Infrastructure Software Engineer

at Nvidia

📍 5 Locations Unknown Posted March 13, 2026
Type Not Specified
Experience senior
Exp. Years 8+ years
Education Bachelor's degree or higher in Computer Science or a related technical field
Category AI & Machine Learning

This role involves developing and maintaining AI infrastructure software for large-scale AI workloads, focusing on efficiency, resiliency, and system reliability.

  • Develop infrastructure software for AI workloads
  • Optimize tools for efficiency
  • Design APIs for resiliency
  • Enhance AI platform reliability
  • Analyze failures from hardware to application

The technical environment includes distributed systems, observability tools like ELK, Prometheus, Loki, and programming in Python and C/C++ for AI infrastructure.

The ideal candidate is a senior software engineer with over 8 years of experience in developing infrastructure for large-scale AI systems. They possess strong skills in distributed systems, observability tools, and software engineering best practices, with a focus on scalability and resiliency.

8+ years of experience in developing software infrastructure for large scale AI systemsProficiency in PythonC/C++Experience with observability platforms (ELKPrometheusLoki)Building and scaling large-scale distributed systemsExperience with AI training and inferencing infrastructure
experience with software testing practicesversion controlCI/CD pipelinesrisk managementblameless postmortems
ELKPrometheusLokiAPIs
PythonC/C++ELKPrometheusLokidistributed systemsAPIsobservability platformssoftware infrastructureproblem-solving
PythonC/C++ELK (ElasticsearchLogstashKibana)PrometheusLokidistributed systemssoftware infrastructureAPIslarge-scale AI systemsobservability platforms
problem-solvingroot cause analysiscollaborationcommunicationanalytical thinking
Industry Technology
Job Function AI infrastructure software engineering
AI infrastructurelarge-scale AI systemsPythonC/C++ELKPrometheusLokidistributed systemsAPIsobservability platformssoftware engineeringroot cause analysisproblem-solvingscalabilityresiliency

Less than 8 years of experience in AI infrastructure, Lack of experience with distributed systems or observability platforms, No proficiency in Python or C/C++

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile