✦ Luna Orbit — DevOps & SRE

HPC Systems Engineer

at KLA

📍 Ann Arbor, MI Unknown 💰 $105K – $180K USD / year Posted April 14, 2026
Salary $105K – $180K USD / year
Type Not Specified
Experience mid
Exp. Years Not specified
Education Not specified
Category DevOps & SRE

Support and evolve the high-performance Linux cluster that powers KLA R&D workloads including physics modeling, simulation, algorithm development, and machine learning. The role focuses on reliability, performance, and scalability of a shared, mission-critical HPC environment.

  • Support and evolve a highperformance Linux cluster
  • Drive reliability, performance, and scalability of a missioncritical HPC environment
  • Partner with infrastructure, DevOps, and application teams
  • Enable physics modeling, simulation, and algorithm development workloads
  • Ensure the platform remains fast, resilient, and ready for demanding computational challenges

This position is centered on HPC infrastructure operations for a shared high-performance Linux cluster used for simulation and machine learning workloads. It requires skills in maintaining reliability and performance at scale while partnering with DevOps and application teams.

The ideal candidate is an HPC-focused systems engineer experienced with maintaining and evolving high-performance Linux clusters for compute-heavy workloads. They bring strength in reliability, performance, and scalability for mission-critical shared HPC environments and collaborate effectively with infrastructure, DevOps, and application teams.

support and evolve a highperformance Linux clusterdrive reliabilityperformanceand scalability of a sharedmissioncritical HPC environment
HPChighperformance Linux clustercompute infrastructurephysics modelingsimulationalgorithm developmentmachinelearning workloadsreliabilityperformancescalability
high-performance Linux clusterLinux cluster reliabilityHPCphysics modelingsimulationalgorithm developmentmachinelearning workloadscompute infrastructureHPC environment scalabilityperformance tuningreliability engineering
partnership with infrastructure teamspartnership with DevOps teamscross-functional collaboration with application teamscommunicationproblem-solving
Industry Manufacturing
Job Function Maintain and scale HPC compute infrastructure for R&D modeling, simulation, and machine learning workloads
Role Subtype Site Reliability Engineer
Tech Domains Linux
HPC Systems EngineerHPChighperformance Linux clusterLinux clustercompute infrastructurephysics modelingsimulationalgorithm developmentmachinelearning workloadsreliabilityperformancescalabilitysharedmissioncritical HPC environmentinfrastructureDevOpsapplication teamsLinux

Must have experience supporting and evolving a highperformance Linux cluster for HPC workloads, Must demonstrate ability to drive reliability, performance, and scalability in an HPC environment

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile