Position Details

Type Not Specified

Experience senior

Exp. Years 3+ years

Education Not specified

Category AI & Machine Learning

About this role

This role focuses on improving the reliability and robustness of Anthropic's AI systems, particularly large language models, through monitoring, incident management, and infrastructure design.

Key Responsibilities

Develop SLAs for AI systems
Design monitoring and observability
Lead incident response
Support model serving infrastructure
Collaborate across teams

Technical Overview

The position involves working with distributed systems, cloud infrastructure, observability tools, and incident response processes to ensure high reliability of AI services.

Ideal Candidate

The ideal candidate is a senior reliability engineer or SRE with 3+ years of experience in distributed systems, large language models, and cloud infrastructure, capable of leading incident response and designing high-availability systems.

Must-Have Skills

Distributed systemsReliability engineeringMonitoring and observabilityIncident responseLarge language models

Nice-to-Have Skills

Experience with GPUs or TPUsML hardware acceleratorsLarge-scale model servingProduction engineering

Tools & Platforms

Cloud providersMonitoring toolsObservability platforms

Required Skills

Distributed systemsReliability engineeringService Level ObjectivesMonitoring and observabilityIncident responseLarge language modelsCloud providersSRE

Hard Skills

Distributed systemsReliability engineeringService Level ObjectivesMonitoring and observabilityHigh-availability infrastructureIncident responseLarge language modelsCloud providersInfrastructure designSRE

Soft Skills

CuriosityBraveryHolistic thinkingRelationship buildingCommunicationCollaboration

Industry & Role

Industry Technology / SaaS

Job Function Enhance reliability and robustness of AI serving infrastructure

Keywords for Your Resume

Distributed systemsReliability engineeringService Level ObjectivesMonitoringObservabilityIncident responseLarge language modelsML hardware acceleratorsGPUTPUCloud providersSREReliabilityHigh-availability infrastructureIncident managementMonitoring and observability

Deal Breakers

Lack of experience with distributed systems or reliability engineering, No familiarity with large language models, Absence of cloud infrastructure experience

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile

Staff / Senior Software Engineer, AI Reliability

Get matched to jobs like this