About this role
In this Principal Product Manager role for Platform and ML Infra, you will define product strategy for how developers interact with AWS Trainium through AWS Neuron. You will focus on developer experience, orchestration, resiliency, and observability for high-performance ML training and inference.
Key Responsibilities
- Drive product strategy for developer interaction with Trainium through container ecosystems, resource management platforms, and AWS services
- Lead Neuron integration strategy with orchestration tools (SLURM, Kubernetes), including EKS and SageMaker
- Define resiliency and observability tooling strategy (diagnostics, performance monitoring, health monitoring, automated recovery, telemetry)
- Develop deep knowledge of Trainium Architecture and Neuron Runtime System (including Neuron Runtime Library, Neuron Kernel Driver, Collective Communication Stack)
- Partner with engineering, PMs, marketing, business development, and solution architects to make informed technical product decisions
Technical Overview
Own strategy across container ecosystems and resource management platforms, including Neuron integration with SLURM and Kubernetes schedulers. Drive capabilities for monitoring, performance/health telemetry, automated recovery, and runtime interactions with ML frameworks to support distributed, high-performance execution on AWS services such as EKS and SageMaker.
Ideal Candidate
The ideal candidate is a technical product manager with experience driving developer-facing runtime and infrastructure products in fast-moving environments. They bring strong knowledge of resource management and orchestration (SLURM, Kubernetes) and can lead product strategy for ML monitoring, observability, and resiliency for distributed high-performance computing workloads on AWS. They understand AWS Trainium/AWS Neuron concepts and can partner with engineering teams to make informed technical decisions.
Must-Have Skills
Technical product management for developer-facing runtime and infrastructure productsDeveloper tools (SDKslibrariesAPIs) with focus on developer experienceResource management and orchestration systems (SLURMKubernetes)ML monitoringobservabilityand resilienceDistributed systems and high-performance computing (HPC) environmentsAWS cloud services and infrastructure
Nice-to-Have Skills
container ecosystemsLinux distribution supportNeuron integration with orchestration toolsNeuron Deep Learning ContainersEKSSageMakerNeuron Kernel DriverCollective Communication Stack
Tools & Platforms
AWS TrainiumAWS NeuronAWS InferentiaNeuron Deep Learning ContainersAMIsSLURMKubernetesEKSSageMakerLinuxPyTorchJAXNeuron Runtime LibraryNeuron Kernel DriverCollective Communication StackNeuron Runtime System
Required Skills
technical product managementdeveloper experienceSDKslibrariesAPIsresource managementSLURMKubernetesML monitoringobservabilityresiliencedistributed systemshigh-performance computing (HPC)AWS cloud servicesAWS TrainiumAWS NeuronNeuron Runtime SystemNeuron Deep Learning ContainersAMIsEKSSageMakerLinux distribution supportPyTorchJAX
Hard Skills
technical product managementdeveloper-facing runtime and infrastructure productsdeveloper toolsSDKslibrariesAPIsdeveloper experienceresource management and orchestration systemsSLURMKubernetes schedulersML monitoringobservabilityresiliencydistributed systemshigh-performance computing (HPC) environmentsAWS cloud services and infrastructureproduct strategycontainer ecosystemsresource management platformsorchestrationorchestration resiliencyobservability and telemetryperformance monitoringhealth monitoringautomated recoveryTrainium ArchitectureNeuron Runtime SystemNeuron integrationNeuron Deep Learning ContainersNeuron Deep Learning ContainerAMIsLinux distribution supportNeuron Runtime LibraryNeuron Kernel DriverCollective Communication Stackoperating AI training and inference workloadsEKSSageMakerAWS TrainiumAWS NeuronAWS InferentiaPyTorchJAXdeep learninggenerative AI
Soft Skills
balance competing customer prioritiesdrive alignment across engineering and business stakeholderswritten and verbal communication abilitiespartnership with engineering and PMstechnical decision makingstakeholder managementability to work in a fast-moving early-stage product environment
Keywords for Your Resume
Principal Product ManagerTechnical Product ManagerPlatform and ML Infradeveloper-facing runtimeinfrastructure productsdeveloper experienceSDKslibrariesAPIsresource managementorchestration systemsSLURMKubernetesKubernetes schedulersML monitoringobservabilityresiliencydistributed systemshigh-performance computingHPC environmentsAWS TrainiumAWS NeuronNeuron Deep Learning ContainersAMIsEKSSageMakerLinux distribution supportNeuron Runtime System
Deal Breakers
Experience with technical product management for developer-facing runtime and infrastructure products, Experience with developer tools (SDKs, libraries, APIs) focused on developer experience, Experience with resource management/orchestration systems (SLURM, Kubernetes)
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile