About this role
Serve as a senior Solutions Architect supporting NVIDIA customers as they build AI/ML and HPC software solutions at scale. You will lead customer technical engagements, create PoCs, and help troubleshoot performance for AI workloads including large scale LLM training and inference.
Key Responsibilities
- Develop and demonstrate solutions using NVIDIA software and hardware technologies
- Serve as main technical point of contact for AI infrastructure performance and debugging
- Conduct regular technical customer meetings and PoC development
- Analyze and develop solutions for customer performance issues for AI and systems
- Collaborate to support cloud service integration for NVIDIA technology on hyperscalers
Technical Overview
Focus on systems and performance engineering for AI accelerators and networking, including building performance benchmarks and analyzing system bottlenecks. Work with customer and cloud environments across AWS (Amazon Web Services), GCP (Google Cloud Platform), Azure, and OCI (Oracle Cloud Infrastructure), using DevOps/MLOps tooling like Docker and Kubernetes.
Ideal Candidate
The ideal candidate is a senior Solutions Architect with 8+ years of engineering experience focused on performance, systems, and solutions for AI/ML and HPC workloads. They have hands-on experience creating performance benchmarks for data center systems, understanding systems architecture with AI accelerators and networking, and can lead customer-facing technical programs and PoCs at scale.
Must-Have Skills
BS/MS/PhD in Electrical/Computer EngineeringComputer SciencePhysicsor other Engineering fields or equivalent experience.8+ years of engineering (performance/system/solution) experience.Hands-on experience building performance benchmarks for data center systemsincluding large scale AI training and inference.Understanding of systems architecture including AI accelerators and networking as it relates to the performance of an overall application.Effective engineering program management with the capability of balancing multiple tasks.Ability to communicate ideas clearly through documentspresentationsand in external customer-facing environments.
Nice-to-Have Skills
Hands-on experience with Deep Learning frameworks (PyTorchJAXetc.)Hands-on experience with compilers (TritonXLAetc.)Hands-on experience with NVIDIA libraries (TRTLLMTensorRTNemoNCCLRAPIDSetc.)Familiarity with deep learning architectures and the latest LLM developments.Background with NVIDIA hardware and softwareperformance tuningand error diagnostics.Hands-on experience with GPU systems in general including but not limited to performance testingperformance tuningand benchmarking.Experience deploying solutions in cloud environments including AWSGCPAzureor OCIknowledge of DevOps/MLOps technologies such as Docker/containersKubernetesdata center deployments
Tools & Platforms
PyTorchJAXTritonXLATRTLLMTensorRTNemoNCCLRAPIDSDockerKubernetesAWS (Amazon Web Services)GCP (Google Cloud Platform)AzureOCI (Oracle Cloud Infrastructure)
Required Skills
AI/ML software solutions at scaleHPC software solutions at scaleperformance aspectslarge scale LLM training and inferenceProof of Concepts (PoCs)performance benchmarksdata center systemssystems architectureAI accelerators and networkingengineering program managementcommunication through documents and presentationsDeep Learning frameworks (PyTorchJAX)compilers (TritonXLA)NVIDIA libraries (TRTLLMTensorRTNemoNCCLRAPIDS)GPU systems performance testingperformance tuningbenchmarkingcloud environments (AWSGCPAzureOCI)DevOps/MLOps technologies (Docker/containersKubernetes)command line proficiency
Hard Skills
AI/ML software solutions at scaleHPC software solutions at scaletechnical supportperformance aspectslarge scale LLM trainingLLM training and inferenceProof of Concepts (PoCs)system performance analysisAI acceleratorssystems architectureengineering program managementperformance benchmarksdata center systemsGPU systems performance testingperformance tuningbenchmarkingdeep learning frameworksPyTorchJAXcompilersTritonXLANVIDIA librariesTRTLLMTensorRTNemoNCCLRAPIDScommand line proficiencycloud environmentsAWS (Amazon Web Services)GCP (Google Cloud Platform)AzureOCI (Oracle Cloud Infrastructure)DevOps/MLOps technologiesDockercontainersKubernetesdata center deploymentscommunication through documentspresentationscustomer-facing environmentsexternal customer-facing environmentseffective engineering program management
Soft Skills
communicate ideas clearly through documentscommunicate through presentationscustomer-facing communicationpartnering with Sales Account Managers and Developer Relations Managersmain technical point of contactregular technical customer meetingsbalancing multiple taskscollaborating with customersdebugging sessionsprogram management
Keywords for Your Resume
Senior Solutions ArchitectSolutions ArchitectAI/MLHPCProof of Concepts (PoCs)performance benchmarksdata center systemslarge scale LLM traininginferencesystems architectureAI acceleratorsnetworkingengineering program managementcustomer-facing environmentsdocumentspresentationstechnical customer meetingsdebugging sessionsPoCsDeep Learning frameworksPyTorchJAXcompilersTritonXLATRTLLMTensorRTNemoNCCLRAPIDSGPU systemsperformance tuningbenchmarkingAWSAmazon Web ServicesGCPGoogle Cloud PlatformAzureOCIOracle Cloud InfrastructureDevOps/MLOpsDockercontainersKubernetesdata center deploymentscommand line proficiencyAWS (Amazon Web Services)
Deal Breakers
Must have 8+ years of engineering (performance/system/solution) experience, Must have hands-on experience building performance benchmarks for data center systems, Must have understanding of systems architecture including AI accelerators and networking
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile