✦ Luna Orbit — AI & Machine Learning

GPU Software Development Eng.

at Advanced Micro Devices

📍 San Jose, California, United States Unknown Posted April 16, 2026
Type Not Specified
Experience mid
Exp. Years Not stated
Education Not specified
Category AI & Machine Learning

Optimize the performance and efficiency of large-scale AI training, reinforcement learning (RL), and inference workloads on AMD GPU platforms. Lead cross-stack performance improvements by analyzing bottlenecks across compute, memory, and communication and collaborating with hardware, compiler, and framework teams.

  • Lead performance optimization of large-scale AI training/RL/inference on AMD GPU platforms
  • Identify and eliminate system bottlenecks across compute, memory, and communication
  • Drive cross-stack optimizations across kernels, compilers, runtimes, and ML frameworks
  • Develop and apply profiling, benchmarking, and performance modeling methodologies
  • Collaborate with hardware, compiler, and framework teams; contribute to open-source efforts

You will drive GPU performance optimization for single-node and multi-node environments, using advanced profiling, benchmarking, and performance modeling. Focus areas include kernel efficiency, memory bandwidth, network utilization, and coordinated improvements across kernels, compilers, runtimes, communication libraries, and ML frameworks—potentially including open-source ecosystem contributions.

The ideal candidate is an AI software engineer recognized as a technical leader with deep expertise in GPU performance optimization and large-scale distributed systems. They can eliminate system bottlenecks across compute, memory, and communication for AMD GPU platforms, using advanced profiling, benchmarking, and performance modeling to improve training efficiency for AI training, RL, and inference workloads.

deep expertise in GPU performance optimizationlarge-scale distributed systemsystem-level bottleneck analysisperformance optimization of large-scale AI training/RL/inference workloads on AMD GPU platforms across single-node and multi-node environmentsadvanced profilingbenchmarkingand performance modeling methodologiescollaboration with hardwarecompilerand framework teams
contribute to and lead open-source efforts to improve ecosystem performance on AMD platformsdeep expertise in GPU architecture and performance characteristics (compute unitsmemory hierarchyinterconnects such as PCIe/Infinity Fabric/RDMA)
AMD GPU platformsPCIeInfinity FabricRDMA
GPU performance analysisGPU performance optimizationdistributed systemsML workloadsGPU architectureinterconnectsmemory hierarchycommunication patternskernelscompilersruntimescommunication librariesML frameworksprofilingbenchmarkingperformance modelingsystem bottleneckskernel efficiencymemory bandwidthnetwork utilizationopen-source efforts
GPU performance analysisdistributed systemsML workloadsGPU performance optimizationGPU architectureinterconnectsmemory hierarchiescommunication patternskernelscompilersruntimescommunication librariesML frameworksperformance optimization of large-scale AI training/RL/inference workloadsprofilingbenchmarkingperformance modelingsystem bottleneck analysiskernel efficiencymemory bandwidthnetwork utilizationsingle-node environmentsmulti-node environmentsopen-source effortsGPU architecture and performance characteristicscompute unitsinterconnects such as PCIe/Infinity Fabric/RDMA
recognized technical leadershipability to influence architectureability to influence technical directionability to translate knowledge into measurable improvementscollaboration with hardwarecompilerand framework teamscomfort operating across layers
Industry Manufacturing
Job Function GPU performance engineering for large-scale generative AI workloads
Role Subtype AI Engineer
Tech Domains AI & Machine Learning, Kubernetes, Linux
GPU Software Development Eng.AI software EngineerGPU performance analysisGPU performance optimizationdistributed systemsML workloadslarge-scale AI trainingRLreinforcement learninginference workloadssingle-nodemulti-nodesystem bottleneckskernel efficiencymemory bandwidthnetwork utilizationkernelscompilersruntimescommunication librariesML frameworksprofilingbenchmarkingperformance modelingGPU architecturememory hierarchyinterconnectsPCIeInfinity FabricRDMAopen-source

Must have deep expertise in GPU performance optimization and system-level bottleneck analysis, Must have experience optimizing large-scale AI training/RL/inference workloads on AMD GPU platforms

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile