✦ Luna Orbit — AI & Machine Learning

Staff / Senior Software Engineer, AI Reliability

at Anthropic

📍 San Francisco, CA | New York City, NY | Seattle, WA Hybrid Posted March 07, 2026
Type Not Specified
Experience senior
Exp. Years 3+ years
Education Not specified
Category AI & Machine Learning

This role focuses on improving the reliability and robustness of Anthropic's AI systems, particularly large language models, through monitoring, incident management, and infrastructure design.

  • Develop SLAs for AI systems
  • Design monitoring and observability
  • Lead incident response
  • Support model serving infrastructure
  • Collaborate across teams

The position involves working with distributed systems, cloud infrastructure, observability tools, and incident response processes to ensure high reliability of AI services.

The ideal candidate is a senior reliability engineer or SRE with 3+ years of experience in distributed systems, large language models, and cloud infrastructure, capable of leading incident response and designing high-availability systems.

Distributed systemsReliability engineeringMonitoring and observabilityIncident responseLarge language models
Experience with GPUs or TPUsML hardware acceleratorsLarge-scale model servingProduction engineering
Cloud providersMonitoring toolsObservability platforms
Distributed systemsReliability engineeringService Level ObjectivesMonitoring and observabilityIncident responseLarge language modelsCloud providersSRE
Distributed systemsReliability engineeringService Level ObjectivesMonitoring and observabilityHigh-availability infrastructureIncident responseLarge language modelsCloud providersInfrastructure designSRE
CuriosityBraveryHolistic thinkingRelationship buildingCommunicationCollaboration
Industry Technology / SaaS
Job Function Enhance reliability and robustness of AI serving infrastructure
Distributed systemsReliability engineeringService Level ObjectivesMonitoringObservabilityIncident responseLarge language modelsML hardware acceleratorsGPUTPUCloud providersSREReliabilityHigh-availability infrastructureIncident managementMonitoring and observability

Lack of experience with distributed systems or reliability engineering, No familiarity with large language models, Absence of cloud infrastructure experience

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile