Position Details

Salary $215K – $245K USD / year

Type Not Specified

Experience executive

Exp. Years 10+ years

Education Bachelor's degree in a technical field or equivalent hands-on experience architecting large scale HPC or AI systems

Category AI & Machine Learning

About this role

This role is for a Principal Architect who designs and oversees HPC and AI platforms in the NVIDIA ecosystem. You will be responsible for end-to-end architecture across compute, networking, storage, orchestration, scheduling, and documentation.

Key Responsibilities

Architect NVIDIA-based HPC and AI data center platforms (HGX/DGX)
Design high-performance networking and storage integrations for AI/HPC
Use BCM, Slurm, Run:AI, and Kubernetes to orchestrate workloads
Optimize performance, utilization, and cost efficiency for HPC/AI platforms
Create reusable architectural documentation and operational runbooks

Technical Overview

The technical scope includes deep architectural knowledge of NVIDIA HGX and DGX platforms, Spectrum-X networking, and scale-out storage integration (VAST Data, Netapp, WEKA, DDN, Lustre) for AI and HPC workloads. You will use and administer NVIDIA Base Command Manager (BCM), Slurm, Run:AI, Kubernetes, and Linux to deliver performant, reproducible, and optimized AI factory/HPC platforms.

Ideal Candidate

The ideal candidate is a principal-level architect with 10+ years designing and optimizing HPC and AI data center platforms, specifically within the NVIDIA ecosystem (HGX, DGX, Spectrum-X). They have hands-on experience with NVIDIA Base Command Manager (BCM), Slurm, Run:AI, Kubernetes administration, and integrating scale-out storage systems (e.g., VAST Data, Netapp, WEKA, DDN, Lustre) into GPU-accelerated environments.

Must-Have Skills

None listed

Nice-to-Have Skills

liquid coolingpower/cooling designdata center integrationmulti-siteair-gappedor regulated environmentsExperience optimizing existing HPC or AI platforms for performanceutilizationand cost efficiency

Tools & Platforms

NVIDIA Base Command Manager (BCM)BCMSlurmRun:AIKubernetesVAST DataNetappWEKADDNLustreHGXDGXSpectrum-XLinux

Required Skills

NVIDIA data center platforms (HGX and DGX)GPU-accelerated compute architectureSpectrum-Xlarge-scale AI factory and HPC platform designhigh-performance parallel or scale-out storage systemsstorage performance characteristics (bandwidthIOPSlatencymetadata scaling)VAST DataNetappWEKADDNLustreNVIDIA Base Command Manager (BCM)SlurmRun:AIKubernetes administrationLinux systems administrationcontainerized AI workflowsmulti-tenant AI workload optimizationperformance/utilization/cost efficiency optimizationmulti-site air-gapped environmentsliquid coolingpower/cooling designarchitectural documentationdesign blueprintsconfiguration guidesdeployment validation reportsoperational runbooksOne Voice standards

Hard Skills

NVIDIA data center platformsHGXDGXGPU-accelerated compute architectureAI workloadsHPC workloadsHigh-performance networking architecturesSpectrum-XLarge-scale AI factory and HPC platform designhigh-performance parallelscale-out storage systemsstorage performance characteristicsbandwidthIOPSlatencymetadata scalingVAST DataNetappWEKADDNLustreGPU orchestrationmultitenant AI workload optimizationNVIDIA Base Command Manager (BCM)BCMSlurmRun:AIKubernetes administrationLinux systems administrationcontainerized AI workflowsHPC platform optimization for performanceutilizationand cost efficiencymulti-siteair-gappedregulated environmentsliquid coolingpower/cooling designdata center integrationarchitectural documentationdesign blueprintsconfiguration guidesdeployment validation reportsoperational runbooksreusable templatesreference architecturesstandardized design patternsOne Voice standards

Soft Skills

Senior individual contributor roletechnical authoritymentor engineers and architectsdesign reviewsarchitectural guidancetechnical leadershipoperate autonomouslydocumentation disciplinedocumentation clarity completeness technical accuracyculture of documentation

Industry & Role

Industry Consulting

Job Function Lead technical architecture for NVIDIA ecosystem HPC and AI platforms

Role Subtype Cloud Architect

Tech Domains Amazon Web Services, Google Cloud Platform, Azure, Kubernetes, Linux, VMware, AI & Machine Learning, Cloud & Infrastructure

Keywords for Your Resume

Principal ArchitectPrincipal Architect – HPC & AI (NVidia Ecosystem)NVIDIANVIDIA data center platformsHGXDGXGPU-accelerated compute architectureAI workloadsHPC workloadsSpectrum-Xhigh-performance networking architectureslarge-scale AI factoryHPC platform designhigh-performance parallelscale-out storage systemsbandwidthIOPSlatencymetadata scalingVAST DataNetappWEKADDNLustreNVIDIA Base Command Manager (BCM)BCMSlurmRun:AIKubernetesLinuxliquid cooling

Deal Breakers

10+ years in HPC and data center experience, Expert level with deep architectural knowledge of NVIDIA data center platforms (HGX and DGX)

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile