About this role
Senior Engineering Manager for Databricks AI Runtime, leading the team responsible for both product experience and the foundational GPU training infrastructure at scale.
Key Responsibilities
- Lead, mentor, and grow a high-performing engineering team; Define and own the AIR roadmap; Collaborate with product, research, platform, and customers; Build observability and reliability practices; Partner with recruiting to attract talent
Technical Overview
Owns AIR roadmap and GPU training infrastructure, drives multi-team collaboration across product, research, platform, and customers to deliver scalable training capabilities.
Ideal Candidate
The ideal candidate is an experienced engineering leader with 8+ years of software engineering and 3+ years in management, who has built and operated GPU-accelerated training infrastructure at scale and can drive product and research initiatives.
Must-Have Skills
8+ years of software engineering experiencewith 3+ years in engineering managementTrack record building and operating managed GPU training infrastructure at scale (100s/1000s GPUs)Deep familiarity with distributed training frameworks (PyTorchDeepSpeedComposerMegatron-LM) and parallelism strategies (FSDPtensor/pipeline parallelism)Experience with training resilience patterns: checkpointingelastic trainingand automated failure recovery for long-running jobsUnderstanding of GPU performance fundamentals including NCCLinterconnect topologiesand memory optimizationExperience building platform products with clear SLAs where you've owned the customer experiencenot just the backendStrong cross-functional leadership across platformproductand research teamsBS/MS in Computer ScienceElectrical Engineeringor related technical field
Required Skills
8+ years software engineering; 3+ years in engineering management; distributed training frameworks (PyTorchDeepSpeedComposerMegatron-LM); parallelism strategies (FSDPtensor parallelism); training resilience (checkpointingelastic trainingfailure recovery); GPU performance (NCCL); platform products with SLAs; BS/MS in CS or related
Hard Skills
PyTorchDeepSpeedComposerMegatron-LMFSDPtensor parallelismpipeline parallelismNCCLGPU performancecheckpointingelastic trainingfailure recoverydistributed trainingPython
Soft Skills
leadershipcross-functional collaborationcommunicationstakeholder managementmentoring
Keywords for Your Resume
Senior Engineering ManagerEngineering ManagerDatabricksAI RuntimeAIRGPU trainingPyTorchDeepSpeedComposerMegatron-LMFSDPtensor parallelismpipeline parallelismcheckpointingelastic trainingfailure recoveryNCCLdistributed trainingPythonBS/MSleading teamsobservabilitySLAsenior engineering managerdatabricksai runtimegpu trainingpytorchdeepspeedcomposermegatron-lmfsdp
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile