Position Details

Type Not Specified

Experience senior

Exp. Years 10+ years

Education Not specified

Category System Administration

About this role

This role involves owning the compute uptime and resilience of large-scale AI clusters, focusing on automation, reliability, and system optimization.

Key Responsibilities

Own infrastructure strategy
Build scalable AI clusters
Define architecture
Collaborate with cloud providers
Establish operational practices

Technical Overview

The technical environment includes distributed systems, cloud infrastructure, Linux kernel tuning, eBPF, and automation tools to ensure high availability and performance of AI compute clusters.

Ideal Candidate

The ideal candidate is a senior systems engineer with over 10 years of experience in distributed systems, reliability, and cloud infrastructure. They possess deep expertise in Linux kernel tuning, eBPF, and systems automation, with a focus on building resilient AI infrastructure.

Must-Have Skills

10+ years of software engineering experienceDeep expertise in distributed systems and reliabilityExperience with cloud platforms (AWSGCP)Strong systems programming skills (PythonRustGoJava)Experience with Linux kernel tuning and eBPF

Nice-to-Have Skills

Security and privacy best practicesMachine learning infrastructure experienceNetworking infrastructure knowledge

Tools & Platforms

KubernetesAWSGCPLinuxeBPFLinux kernel

Required Skills

Distributed systemsReliability engineeringCloud platformsKubernetesAWSGCPLinuxInfrastructure automationeBPFKernel tuning

Hard Skills

Distributed systemsReliability engineeringCloud platformsKubernetesAWSGCPLinuxInfrastructure automationeBPFKernel tuning

Soft Skills

LeadershipStrategic thinkingCommunicationProblem-solvingTeam mentoring

Industry & Role

Industry AI & Machine Learning

Job Function Manage and optimize large-scale AI infrastructure for reliability and performance

Keywords for Your Resume

Distributed systemsReliability engineeringCloud platformsKubernetesAWSGCPLinuxInfrastructure automationeBPFKernel tuningSecurityPrivacyMachine learning infrastructure

Deal Breakers

Less than 10 years of experience, Lack of expertise in distributed systems or reliability, No experience with cloud platforms or Linux kernel tuning

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile

Staff+ Software Engineer, Systems

Get matched to jobs like this