Position Details

Type Not Specified

Experience mid

Exp. Years Not stated

Education Not specified

Category AI & Machine Learning

About this role

Optimize the performance and efficiency of large-scale AI training, reinforcement learning (RL), and inference workloads on AMD GPU platforms. Lead cross-stack performance improvements by analyzing bottlenecks across compute, memory, and communication and collaborating with hardware, compiler, and framework teams.

Key Responsibilities

Lead performance optimization of large-scale AI training/RL/inference on AMD GPU platforms
Identify and eliminate system bottlenecks across compute, memory, and communication
Drive cross-stack optimizations across kernels, compilers, runtimes, and ML frameworks
Develop and apply profiling, benchmarking, and performance modeling methodologies
Collaborate with hardware, compiler, and framework teams; contribute to open-source efforts

Technical Overview

You will drive GPU performance optimization for single-node and multi-node environments, using advanced profiling, benchmarking, and performance modeling. Focus areas include kernel efficiency, memory bandwidth, network utilization, and coordinated improvements across kernels, compilers, runtimes, communication libraries, and ML frameworks—potentially including open-source ecosystem contributions.

Ideal Candidate

The ideal candidate is an AI software engineer recognized as a technical leader with deep expertise in GPU performance optimization and large-scale distributed systems. They can eliminate system bottlenecks across compute, memory, and communication for AMD GPU platforms, using advanced profiling, benchmarking, and performance modeling to improve training efficiency for AI training, RL, and inference workloads.

Must-Have Skills

deep expertise in GPU performance optimizationlarge-scale distributed systemsystem-level bottleneck analysisperformance optimization of large-scale AI training/RL/inference workloads on AMD GPU platforms across single-node and multi-node environmentsadvanced profilingbenchmarkingand performance modeling methodologiescollaboration with hardwarecompilerand framework teams

Nice-to-Have Skills

contribute to and lead open-source efforts to improve ecosystem performance on AMD platformsdeep expertise in GPU architecture and performance characteristics (compute unitsmemory hierarchyinterconnects such as PCIe/Infinity Fabric/RDMA)

Tools & Platforms

AMD GPU platformsPCIeInfinity FabricRDMA

Required Skills

GPU performance analysisGPU performance optimizationdistributed systemsML workloadsGPU architectureinterconnectsmemory hierarchycommunication patternskernelscompilersruntimescommunication librariesML frameworksprofilingbenchmarkingperformance modelingsystem bottleneckskernel efficiencymemory bandwidthnetwork utilizationopen-source efforts

Hard Skills

GPU performance analysisdistributed systemsML workloadsGPU performance optimizationGPU architectureinterconnectsmemory hierarchiescommunication patternskernelscompilersruntimescommunication librariesML frameworksperformance optimization of large-scale AI training/RL/inference workloadsprofilingbenchmarkingperformance modelingsystem bottleneck analysiskernel efficiencymemory bandwidthnetwork utilizationsingle-node environmentsmulti-node environmentsopen-source effortsGPU architecture and performance characteristicscompute unitsinterconnects such as PCIe/Infinity Fabric/RDMA

Soft Skills

recognized technical leadershipability to influence architectureability to influence technical directionability to translate knowledge into measurable improvementscollaboration with hardwarecompilerand framework teamscomfort operating across layers

Industry & Role

Industry Manufacturing

Job Function GPU performance engineering for large-scale generative AI workloads

Role Subtype AI Engineer

Tech Domains AI & Machine Learning, Kubernetes, Linux

Keywords for Your Resume

GPU Software Development Eng.AI software EngineerGPU performance analysisGPU performance optimizationdistributed systemsML workloadslarge-scale AI trainingRLreinforcement learninginference workloadssingle-nodemulti-nodesystem bottleneckskernel efficiencymemory bandwidthnetwork utilizationkernelscompilersruntimescommunication librariesML frameworksprofilingbenchmarkingperformance modelingGPU architecturememory hierarchyinterconnectsPCIeInfinity FabricRDMAopen-source

Deal Breakers

Must have deep expertise in GPU performance optimization and system-level bottleneck analysis, Must have experience optimizing large-scale AI training/RL/inference workloads on AMD GPU platforms

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile