About this role
Optimize the performance and efficiency of large-scale AI training, reinforcement learning (RL), and inference workloads on AMD GPU platforms. Lead cross-stack performance improvements by analyzing bottlenecks across compute, memory, and communication and collaborating with hardware, compiler, and framework teams.
Key Responsibilities
- Lead performance optimization of large-scale AI training/RL/inference on AMD GPU platforms
- Identify and eliminate system bottlenecks across compute, memory, and communication
- Drive cross-stack optimizations across kernels, compilers, runtimes, and ML frameworks
- Develop and apply profiling, benchmarking, and performance modeling methodologies
- Collaborate with hardware, compiler, and framework teams; contribute to open-source efforts
Technical Overview
You will drive GPU performance optimization for single-node and multi-node environments, using advanced profiling, benchmarking, and performance modeling. Focus areas include kernel efficiency, memory bandwidth, network utilization, and coordinated improvements across kernels, compilers, runtimes, communication libraries, and ML frameworks—potentially including open-source ecosystem contributions.
Ideal Candidate
The ideal candidate is an AI software engineer recognized as a technical leader with deep expertise in GPU performance optimization and large-scale distributed systems. They can eliminate system bottlenecks across compute, memory, and communication for AMD GPU platforms, using advanced profiling, benchmarking, and performance modeling to improve training efficiency for AI training, RL, and inference workloads.
Must-Have Skills
deep expertise in GPU performance optimizationlarge-scale distributed systemsystem-level bottleneck analysisperformance optimization of large-scale AI training/RL/inference workloads on AMD GPU platforms across single-node and multi-node environmentsadvanced profilingbenchmarkingand performance modeling methodologiescollaboration with hardwarecompilerand framework teams
Nice-to-Have Skills
contribute to and lead open-source efforts to improve ecosystem performance on AMD platformsdeep expertise in GPU architecture and performance characteristics (compute unitsmemory hierarchyinterconnects such as PCIe/Infinity Fabric/RDMA)
Tools & Platforms
AMD GPU platformsPCIeInfinity FabricRDMA
Required Skills
GPU performance analysisGPU performance optimizationdistributed systemsML workloadsGPU architectureinterconnectsmemory hierarchycommunication patternskernelscompilersruntimescommunication librariesML frameworksprofilingbenchmarkingperformance modelingsystem bottleneckskernel efficiencymemory bandwidthnetwork utilizationopen-source efforts
Hard Skills
GPU performance analysisdistributed systemsML workloadsGPU performance optimizationGPU architectureinterconnectsmemory hierarchiescommunication patternskernelscompilersruntimescommunication librariesML frameworksperformance optimization of large-scale AI training/RL/inference workloadsprofilingbenchmarkingperformance modelingsystem bottleneck analysiskernel efficiencymemory bandwidthnetwork utilizationsingle-node environmentsmulti-node environmentsopen-source effortsGPU architecture and performance characteristicscompute unitsinterconnects such as PCIe/Infinity Fabric/RDMA
Soft Skills
recognized technical leadershipability to influence architectureability to influence technical directionability to translate knowledge into measurable improvementscollaboration with hardwarecompilerand framework teamscomfort operating across layers
Keywords for Your Resume
GPU Software Development Eng.AI software EngineerGPU performance analysisGPU performance optimizationdistributed systemsML workloadslarge-scale AI trainingRLreinforcement learninginference workloadssingle-nodemulti-nodesystem bottleneckskernel efficiencymemory bandwidthnetwork utilizationkernelscompilersruntimescommunication librariesML frameworksprofilingbenchmarkingperformance modelingGPU architecturememory hierarchyinterconnectsPCIeInfinity FabricRDMAopen-source
Deal Breakers
Must have deep expertise in GPU performance optimization and system-level bottleneck analysis, Must have experience optimizing large-scale AI training/RL/inference workloads on AMD GPU platforms
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile