About this role
Anthropic is seeking a Staff-level Infrastructure Engineer to design and implement high-performance data processing infrastructure for large language model training. The role focuses on scalable, fault-tolerant systems, data quality, reproducibility, and observability, collaborating with research teams.
Key Responsibilities
- Design and implement high-performance data processing infrastructure for LLM training
- Develop and maintain core processing primitives (tokenization, deduplication, chunking)
- Build robust data quality assurance and validation at scale
- Implement comprehensive monitoring systems for data processing infrastructure
- Create and optimize distributed computing systems for processing web-scale datasets
Technical Overview
Stack includes Python, Rust, Apache Spark, IaC (Terraform), and cloud platforms to build distributed data processing pipelines, with emphasis on monitoring, traceability, and scalable data prep for LLMs.
Ideal Candidate
The ideal candidate is a senior infrastructure engineer with 5+ years of distributed systems experience, strong Python and Rust skills, and proven ability to design scalable data processing infrastructure for large language model training.
Must-Have Skills
5+ years of experience in distributed systemsPythonRustApache SparkDistributed systemsHigh-throughputfault-tolerant system designInfrastructure as CodeTerraformCloud computing platformsTokenization algorithmsMonitoring and observabilityReproducibility and traceability in data preparation
Nice-to-Have Skills
Significant experience building and maintaining large-scale distributed systemsReliability and performance focusAbility to solve complex technical challenges at scaleOwnership and independenceInterest in ML infrastructureExperience with ML data pipelines
Tools & Platforms
Apache SparkTerraformInfrastructure as CodeCloud computing platforms
Required Skills
PythonRustApache Sparkdistributed systemshigh-throughputfault-tolerant system designdata processing infrastructuretokenizationtokenization algorithmsmonitoringobservabilityInfrastructure as CodeTerraformcloud computing platformsdata quality assurancereproducibilitytraceabilitylarge language model training
Hard Skills
PythonRustApache SparkDistributed systemsInfrastructure as CodeTerraformCloud computing platformsData processing infrastructureTokenizationTokenization algorithmsMonitoringObservabilityHigh-throughput system designFault-tolerant system designData quality assuranceReproducibilityTraceabilityLarge language model trainingPerformance optimization
Soft Skills
Strong problem-solving skillsAttention to detailExcellent communicationCollaborativeTeam playerAbility to work with ambiguityDelivery-focusedProactive
Keywords for Your Resume
Infrastructure Engineerdistributed systemsPythonRustApache SparkInfrastructure as CodeTerraformcloud computing platformstokenization algorithmsMonitoringObservabilitydata quality assurancereproducibilitytraceabilitylarge language model traininghigh-throughputfault-tolerantperformance optimizationdata processing infrastructureML infrastructuretokenization
Deal Breakers
Less than 5 years of experience, No Python or Rust experience, No distributed systems experience, Unwilling to work on-site in San Francisco
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile