✦ Luna Orbit — Product Management

Principal Product Manager - Platform and ML Infra (AI/ML), Annapurna Labs

at Amazon.com

📍 US, CA, Cupertino Unknown Posted April 14, 2026
Type Full-Time
Experience executive
Exp. Years Not specified
Education Not specified
Category Product Management

In this Principal Product Manager role for Platform and ML Infra, you will define product strategy for how developers interact with AWS Trainium through AWS Neuron. You will focus on developer experience, orchestration, resiliency, and observability for high-performance ML training and inference.

  • Drive product strategy for developer interaction with Trainium through container ecosystems, resource management platforms, and AWS services
  • Lead Neuron integration strategy with orchestration tools (SLURM, Kubernetes), including EKS and SageMaker
  • Define resiliency and observability tooling strategy (diagnostics, performance monitoring, health monitoring, automated recovery, telemetry)
  • Develop deep knowledge of Trainium Architecture and Neuron Runtime System (including Neuron Runtime Library, Neuron Kernel Driver, Collective Communication Stack)
  • Partner with engineering, PMs, marketing, business development, and solution architects to make informed technical product decisions

Own strategy across container ecosystems and resource management platforms, including Neuron integration with SLURM and Kubernetes schedulers. Drive capabilities for monitoring, performance/health telemetry, automated recovery, and runtime interactions with ML frameworks to support distributed, high-performance execution on AWS services such as EKS and SageMaker.

The ideal candidate is a technical product manager with experience driving developer-facing runtime and infrastructure products in fast-moving environments. They bring strong knowledge of resource management and orchestration (SLURM, Kubernetes) and can lead product strategy for ML monitoring, observability, and resiliency for distributed high-performance computing workloads on AWS. They understand AWS Trainium/AWS Neuron concepts and can partner with engineering teams to make informed technical decisions.

Technical product management for developer-facing runtime and infrastructure productsDeveloper tools (SDKslibrariesAPIs) with focus on developer experienceResource management and orchestration systems (SLURMKubernetes)ML monitoringobservabilityand resilienceDistributed systems and high-performance computing (HPC) environmentsAWS cloud services and infrastructure
container ecosystemsLinux distribution supportNeuron integration with orchestration toolsNeuron Deep Learning ContainersEKSSageMakerNeuron Kernel DriverCollective Communication Stack
AWS TrainiumAWS NeuronAWS InferentiaNeuron Deep Learning ContainersAMIsSLURMKubernetesEKSSageMakerLinuxPyTorchJAXNeuron Runtime LibraryNeuron Kernel DriverCollective Communication StackNeuron Runtime System
technical product managementdeveloper experienceSDKslibrariesAPIsresource managementSLURMKubernetesML monitoringobservabilityresiliencedistributed systemshigh-performance computing (HPC)AWS cloud servicesAWS TrainiumAWS NeuronNeuron Runtime SystemNeuron Deep Learning ContainersAMIsEKSSageMakerLinux distribution supportPyTorchJAX
technical product managementdeveloper-facing runtime and infrastructure productsdeveloper toolsSDKslibrariesAPIsdeveloper experienceresource management and orchestration systemsSLURMKubernetes schedulersML monitoringobservabilityresiliencydistributed systemshigh-performance computing (HPC) environmentsAWS cloud services and infrastructureproduct strategycontainer ecosystemsresource management platformsorchestrationorchestration resiliencyobservability and telemetryperformance monitoringhealth monitoringautomated recoveryTrainium ArchitectureNeuron Runtime SystemNeuron integrationNeuron Deep Learning ContainersNeuron Deep Learning ContainerAMIsLinux distribution supportNeuron Runtime LibraryNeuron Kernel DriverCollective Communication Stackoperating AI training and inference workloadsEKSSageMakerAWS TrainiumAWS NeuronAWS InferentiaPyTorchJAXdeep learninggenerative AI
balance competing customer prioritiesdrive alignment across engineering and business stakeholderswritten and verbal communication abilitiespartnership with engineering and PMstechnical decision makingstakeholder managementability to work in a fast-moving early-stage product environment
Industry SaaS
Job Function Lead technical product strategy for AWS Neuron/Trainium developer runtime and ML infrastructure.
Role Subtype Technical Product Manager
Tech Domains Amazon Web Services, Kubernetes, Linux, Python, AI & Machine Learning
Principal Product ManagerTechnical Product ManagerPlatform and ML Infradeveloper-facing runtimeinfrastructure productsdeveloper experienceSDKslibrariesAPIsresource managementorchestration systemsSLURMKubernetesKubernetes schedulersML monitoringobservabilityresiliencydistributed systemshigh-performance computingHPC environmentsAWS TrainiumAWS NeuronNeuron Deep Learning ContainersAMIsEKSSageMakerLinux distribution supportNeuron Runtime System

Experience with technical product management for developer-facing runtime and infrastructure products, Experience with developer tools (SDKs, libraries, APIs) focused on developer experience, Experience with resource management/orchestration systems (SLURM, Kubernetes)

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile