Position Details
About this role
This role focuses on improving the reliability and robustness of Anthropic's AI systems, particularly large language models, through monitoring, incident management, and infrastructure design.
Key Responsibilities
- Develop SLAs for AI systems
- Design monitoring and observability
- Lead incident response
- Support model serving infrastructure
- Collaborate across teams
Technical Overview
The position involves working with distributed systems, cloud infrastructure, observability tools, and incident response processes to ensure high reliability of AI services.
Ideal Candidate
The ideal candidate is a senior reliability engineer or SRE with 3+ years of experience in distributed systems, large language models, and cloud infrastructure, capable of leading incident response and designing high-availability systems.
Must-Have Skills
Nice-to-Have Skills
Tools & Platforms
Required Skills
Hard Skills
Soft Skills
Industry & Role
Keywords for Your Resume
Deal Breakers
Lack of experience with distributed systems or reliability engineering, No familiarity with large language models, Absence of cloud infrastructure experience
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile