Position Details
About this role
This role involves developing and maintaining AI infrastructure software for large-scale AI workloads, focusing on efficiency, resiliency, and system reliability.
Key Responsibilities
- Develop infrastructure software for AI workloads
- Optimize tools for efficiency
- Design APIs for resiliency
- Enhance AI platform reliability
- Analyze failures from hardware to application
Technical Overview
The technical environment includes distributed systems, observability tools like ELK, Prometheus, Loki, and programming in Python and C/C++ for AI infrastructure.
Ideal Candidate
The ideal candidate is a senior software engineer with over 8 years of experience in developing infrastructure for large-scale AI systems. They possess strong skills in distributed systems, observability tools, and software engineering best practices, with a focus on scalability and resiliency.
Must-Have Skills
Nice-to-Have Skills
Tools & Platforms
Required Skills
Hard Skills
Soft Skills
Industry & Role
Keywords for Your Resume
Deal Breakers
Less than 8 years of experience in AI infrastructure, Lack of experience with distributed systems or observability platforms, No proficiency in Python or C/C++
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile