Position Details

Type Full-Time

Experience mid

Exp. Years Not specified

Education Master's degree or PhD in Computer Science, Artificial Intelligence, Machine Learning, or related

Category AI & Machine Learning

About this role

This role focuses on training large AI models efficiently across multiple GPUs, improving pipeline performance, and contributing to open source AI frameworks.

Key Responsibilities

Train large models
Optimize training pipelines
Contribute to open source
Collaborate across teams
Stay updated with training algorithms

Technical Overview

The environment involves distributed training pipelines, ML frameworks like PyTorch, TensorFlow, JAX, GPU kernel optimization, and large-scale AI model training.

Ideal Candidate

The ideal candidate is a highly skilled ML engineer with advanced knowledge of distributed training of large models, proficient in frameworks like PyTorch, TensorFlow, or JAX, and experienced in GPU optimization and large-scale AI research.

Must-Have Skills

Distributed training pipelinesML frameworks (PyTorchJAXTensorFlow)Distributed training algorithmsPython or C++ programmingExperience with large modelsGPU optimization

Nice-to-Have Skills

LLMsComputer visionGPU kernel optimizationOpen source contributions

Tools & Platforms

PyTorchJAXTensorFlowMegatron-LMMaxTextTorchTitan

Required Skills

Distributed training pipelinesdata paralleltensor parallelpipeline parallelexpert parallelZeROPyTorchJAXTensorFlowlarge modelsGPUGPU kernel optimizationPythonC++ML frameworkstraining algorithmslarge language modelscomputer vision

Hard Skills

Distributed training pipelinesData ParallelTensor ParallelPipeline ParallelExpert ParallelZeROPyTorchJAXTensorFlowLarge modelsGPUGPU kernel optimizationPythonC++ML frameworksML/DL frameworksOpen sourceTraining algorithmsLarge language modelsComputer vision

Soft Skills

CommunicationProblem-solvingCollaborationAnalytical thinkingTeamwork

Industry & Role

Industry Technology

Job Function Developing and optimizing large-scale AI training pipelines

Role Subtype Machine Learning Engineer

Tech Domains Python, C++, ML frameworks, TensorFlow, PyTorch, JAX, GPU, Distributed systems

Keywords for Your Resume

distributed training pipelinesdata paralleltensor parallelpipeline parallelexpert parallelzeropytorchjaxtensorflowlarge modelsgpugpu kernel optimizationpythonc++ml frameworkstraining algorithmslarge language modelscomputer visiondistributed trainingaimachine learninggpu optimization

Deal Breakers

Lack of experience with distributed training pipelines, No knowledge of ML frameworks (PyTorch, TensorFlow, JAX), No experience with large models or GPU optimization, Unable to work in the specified location

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile

Principal ML Engineer - Large Scale Training Performance Optimization

Get matched to jobs like this