Position Details
About this role
This role involves debugging GPU and server hardware, developing automation software, and analyzing hardware trends within a large-scale ML hardware fleet. The engineer will work on system remediation and operational excellence for ML servers.
Key Responsibilities
- Debug hardware issues
- Develop automation software
- Analyze hardware trends
- Manage system remediation
- Optimize ML server fleet
Technical Overview
The technical environment includes scripting in Python and Bash, hardware debugging of GPU and server systems, data infrastructure development, and automation for fleet operations.
Ideal Candidate
The ideal candidate is a mid-level AI or machine learning engineer with at least 2 years of experience in software development, scripting in Python or Bash, and hardware debugging of GPU and server systems. They are skilled in developing automation tools and analyzing hardware performance trends.
Must-Have Skills
Nice-to-Have Skills
Tools & Platforms
Required Skills
Hard Skills
Soft Skills
Industry & Role
Keywords for Your Resume
Deal Breakers
No experience with GPU or server hardware debugging, Less than 2 years of professional software development experience, Lack of scripting experience in Python or Bash
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile