Position Details
About this role
This role involves leading infrastructure projects for large-scale AI systems, focusing on reliability, compute uptime, and collaboration with cloud providers to solve complex infrastructure challenges.
Key Responsibilities
- Lead infrastructure projects
- Build and maintain AI clusters
- Partner with cloud providers
- Solve compute and reliability challenges
- Improve operational practices
Technical Overview
The technical environment includes distributed systems, Kubernetes, cloud platforms (AWS, GCP), systems languages (Python, Rust, Go, Java), and observability tools like eBPF, aimed at building reliable AI infrastructure.
Ideal Candidate
The ideal candidate is a senior systems engineer with over 6 years of experience in distributed systems, reliability engineering, and cloud platforms like AWS and GCP, with expertise in systems languages such as Python, Rust, or Go, and familiarity with ML infrastructure.
Must-Have Skills
Nice-to-Have Skills
Tools & Platforms
Required Skills
Hard Skills
Soft Skills
Industry & Role
Keywords for Your Resume
Deal Breakers
Less than 6 years of experience, Lack of experience with distributed systems or cloud platforms, No knowledge of systems languages (Python, Rust, Go, Java)
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile