✦ Luna Orbit — AI & Machine Learning

AI Cluster Validation Engineer

at Advanced Micro Devices

📍 Santa Clara, California, United States Hybrid Posted March 28, 2026
Type Full-Time
Experience mid
Exp. Years Not specified
Education Not specified
Category AI & Machine Learning

This role involves validating AI solutions, building automation for distributed training and inference workloads, and ensuring system performance in AI clusters. The engineer will work with the latest hardware and software technologies.

  • Validate AI solutions
  • Build cluster automation
  • Reproduce and prevent defects
  • Develop testing tools
  • Collaborate on hardware/software design

The technical environment includes AI infrastructure validation, cluster automation, performance profiling, and benchmarking using tools like ROCM, Docker, Kubernetes, SLURM, and LLVM, focusing on large-scale AI training and inference.

The ideal candidate is an experienced AI validation engineer with strong skills in software automation, system validation, and infrastructure for AI workloads. They are proficient in scripting, containerization, and performance profiling, with a focus on large-scale distributed AI systems.

PythonLinux Shell scriptingvalidation of AI solutionsbuilding cluster automationperformance profilingbenchmark testingexperience with AI frameworks
DockerKubernetesSLURMLLVMcomplex computer systemsHPC deploymentsnetwork design in RDMA clusterstraining of LLMsinference frameworks
ROCMDockerKubernetesSLURMLLVMIBPerfNCCLROCEv2vLLMSGLang
PythonLinux Shell scriptingDockerKubernetesSLURMLLVMGPU validationAI infrastructuredistributed trainingML frameworks
PythonLinux Shell scriptingDockerKubernetesSLURMLLVMGPU validationAI infrastructuredistributed trainingML frameworksperformance profilingbenchmark testingNCCLROCEv2IBPerftraining of LLMsMoE modelsImage Generationrecommendation modelsinference workloadsvLLMSGLang
communicationsystem designvalidationautomationleadership
Industry Semiconductors & Hardware
Job Function AI infrastructure validation and automation for distributed training
Role Subtype AI Cluster Validation Engineer
Tech Domains Linux, Docker, Kubernetes, Active Directory, Microsoft 365
pythonlinux shell scriptingdockerkubernetesslurmllvmgpu validationai infrastructuredistributed trainingml frameworksperformance profilingbenchmark testingnccLrocev2ibperftraining of llmsinference workloadscluster automationlinux scripting

Lack of experience with AI infrastructure validation, No scripting or automation skills, No experience with distributed training or HPC systems

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile