✦ Luna Orbit — Executive & General Management

Senior Engineering Manager, AI Runtime

at Databricks

📍 Mountain View, California; San Francisco, California Onsite Posted April 02, 2026
Type Full-Time
Experience lead
Exp. Years 8+ years
Education BS/MS in Computer Science, Electrical Engineering, or related technical field
Category Executive & General Management

Senior Engineering Manager for Databricks AI Runtime, leading the team responsible for both product experience and the foundational GPU training infrastructure at scale.

  • Lead, mentor, and grow a high-performing engineering team; Define and own the AIR roadmap; Collaborate with product, research, platform, and customers; Build observability and reliability practices; Partner with recruiting to attract talent

Owns AIR roadmap and GPU training infrastructure, drives multi-team collaboration across product, research, platform, and customers to deliver scalable training capabilities.

The ideal candidate is an experienced engineering leader with 8+ years of software engineering and 3+ years in management, who has built and operated GPU-accelerated training infrastructure at scale and can drive product and research initiatives.

8+ years of software engineering experiencewith 3+ years in engineering managementTrack record building and operating managed GPU training infrastructure at scale (100s/1000s GPUs)Deep familiarity with distributed training frameworks (PyTorchDeepSpeedComposerMegatron-LM) and parallelism strategies (FSDPtensor/pipeline parallelism)Experience with training resilience patterns: checkpointingelastic trainingand automated failure recovery for long-running jobsUnderstanding of GPU performance fundamentals including NCCLinterconnect topologiesand memory optimizationExperience building platform products with clear SLAs where you've owned the customer experiencenot just the backendStrong cross-functional leadership across platformproductand research teamsBS/MS in Computer ScienceElectrical Engineeringor related technical field
8+ years software engineering; 3+ years in engineering management; distributed training frameworks (PyTorchDeepSpeedComposerMegatron-LM); parallelism strategies (FSDPtensor parallelism); training resilience (checkpointingelastic trainingfailure recovery); GPU performance (NCCL); platform products with SLAs; BS/MS in CS or related
PyTorchDeepSpeedComposerMegatron-LMFSDPtensor parallelismpipeline parallelismNCCLGPU performancecheckpointingelastic trainingfailure recoverydistributed trainingPython
leadershipcross-functional collaborationcommunicationstakeholder managementmentoring
Industry SaaS
Job Function Lead the AI Runtime engineering team to deliver scalable GPU training infrastructure and customer-focused capabilities
Role Subtype Engineering Manager
Tech Domains Python, SQL / PostgreSQL, Docker, PostgreSQL
Senior Engineering ManagerEngineering ManagerDatabricksAI RuntimeAIRGPU trainingPyTorchDeepSpeedComposerMegatron-LMFSDPtensor parallelismpipeline parallelismcheckpointingelastic trainingfailure recoveryNCCLdistributed trainingPythonBS/MSleading teamsobservabilitySLAsenior engineering managerdatabricksai runtimegpu trainingpytorchdeepspeedcomposermegatron-lmfsdp
Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile