Position Details

Salary $105K – $180K USD / year

Type Not Specified

Experience mid

Exp. Years Not specified

Education Not specified

Category DevOps & SRE

About this role

Support and evolve the high-performance Linux cluster that powers KLA R&D workloads including physics modeling, simulation, algorithm development, and machine learning. The role focuses on reliability, performance, and scalability of a shared, mission-critical HPC environment.

Key Responsibilities

Support and evolve a highperformance Linux cluster
Drive reliability, performance, and scalability of a missioncritical HPC environment
Partner with infrastructure, DevOps, and application teams
Enable physics modeling, simulation, and algorithm development workloads
Ensure the platform remains fast, resilient, and ready for demanding computational challenges

Technical Overview

This position is centered on HPC infrastructure operations for a shared high-performance Linux cluster used for simulation and machine learning workloads. It requires skills in maintaining reliability and performance at scale while partnering with DevOps and application teams.

Ideal Candidate

The ideal candidate is an HPC-focused systems engineer experienced with maintaining and evolving high-performance Linux clusters for compute-heavy workloads. They bring strength in reliability, performance, and scalability for mission-critical shared HPC environments and collaborate effectively with infrastructure, DevOps, and application teams.

Must-Have Skills

support and evolve a highperformance Linux clusterdrive reliabilityperformanceand scalability of a sharedmissioncritical HPC environment

Required Skills

HPChighperformance Linux clustercompute infrastructurephysics modelingsimulationalgorithm developmentmachinelearning workloadsreliabilityperformancescalability

Hard Skills

high-performance Linux clusterLinux cluster reliabilityHPCphysics modelingsimulationalgorithm developmentmachinelearning workloadscompute infrastructureHPC environment scalabilityperformance tuningreliability engineering

Soft Skills

partnership with infrastructure teamspartnership with DevOps teamscross-functional collaboration with application teamscommunicationproblem-solving

Industry & Role

Industry Manufacturing

Job Function Maintain and scale HPC compute infrastructure for R&D modeling, simulation, and machine learning workloads

Role Subtype Site Reliability Engineer

Tech Domains Linux

Keywords for Your Resume

HPC Systems EngineerHPChighperformance Linux clusterLinux clustercompute infrastructurephysics modelingsimulationalgorithm developmentmachinelearning workloadsreliabilityperformancescalabilitysharedmissioncritical HPC environmentinfrastructureDevOpsapplication teamsLinux

Deal Breakers

Must have experience supporting and evolving a highperformance Linux cluster for HPC workloads, Must demonstrate ability to drive reliability, performance, and scalability in an HPC environment

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile