Position Details
About this role
This role involves owning the compute uptime and resilience of large-scale AI clusters, focusing on automation, reliability, and system optimization.
Key Responsibilities
- Own infrastructure strategy
- Build scalable AI clusters
- Define architecture
- Collaborate with cloud providers
- Establish operational practices
Technical Overview
The technical environment includes distributed systems, cloud infrastructure, Linux kernel tuning, eBPF, and automation tools to ensure high availability and performance of AI compute clusters.
Ideal Candidate
The ideal candidate is a senior systems engineer with over 10 years of experience in distributed systems, reliability, and cloud infrastructure. They possess deep expertise in Linux kernel tuning, eBPF, and systems automation, with a focus on building resilient AI infrastructure.
Must-Have Skills
Nice-to-Have Skills
Tools & Platforms
Required Skills
Hard Skills
Soft Skills
Industry & Role
Keywords for Your Resume
Deal Breakers
Less than 10 years of experience, Lack of expertise in distributed systems or reliability, No experience with cloud platforms or Linux kernel tuning
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile