About this role

This role involves building and maintaining large-scale compute and storage infrastructure to support Cursor’s AI and coding models, working closely with ML researchers and engineers to optimize training systems and hardware utilization.

Key Responsibilities

Improve training throughput
Build GPU infrastructure
Collaborate with ML teams
Automate GPU cluster management
Enhance system reliability

Technical Overview

The role focuses on developing high-performance infrastructure, including GPU clusters, distributed storage, and networking, utilizing tools like Kubernetes, Slurm, and infrastructure-as-code practices across Linux environments.

Ideal Candidate

The ideal candidate is a systems engineer with experience in building large-scale compute and storage infrastructure, proficient in Python, Typescript, Rust, and Golang. They have hands-on experience with distributed storage, networking, and GPU infrastructure, and can operate in Linux and cloud environments, preferably with Kubernetes and Slurm expertise.

Must-Have Skills

Strong background in systems and infrastructure-focused software engineeringExperience with PythonTypescriptRustGolangExperience with distributed storage and networking infrastructureExposure to large-scale systemsProduction use of infrastructure-as-codeOperational exposure to Nvidia GPUs

Nice-to-Have Skills

Exposure to Nvidia Blackwell and Hopper hardwareExperience with RaySlurmKubernetes experience

Tools & Platforms

KubernetesRaySlurmLinuxNvidia GPUsBlackwellHopper

Required Skills

Large-scale computeStorage infrastructureSoftware infrastructureGPU infrastructureDistributed storageNetworking infrastructureLinuxKubernetesK8sGPU clustersNvidia GPUsInfinibandRoCERaySlurmInfrastructure-as-codeConfiguration management

Hard Skills

Large-scale computeStorage infrastructureSoftware infrastructureGPU infrastructureDistributed storageNetworking infrastructureLinuxKubernetesK8sGPU clustersNvidia GPUsInfinibandRoCERaySlurmInfrastructure-as-codeConfiguration management

Soft Skills

Problem-solvingCollaborationSystems thinkingOperational awarenessAdaptability

Industry & Role

Industry SaaS

Job Function Build and optimize large-scale compute and storage infrastructure for AI training

Keywords for Your Resume

Large-scale computeStorage infrastructureSoftware infrastructureGPU infrastructureDistributed storageNetworking infrastructureLinuxKubernetesK8sGPU clustersNvidia GPUsInfinibandRoCERaySlurmInfrastructure-as-codeConfiguration management

Deal Breakers

Lack of experience with large-scale systems, No experience with Nvidia GPUs or infrastructure-as-code, Unfamiliarity with Linux or Kubernetes

Software Engineer, ML Infrastructure

Get matched to jobs like this