Position Details
About this role
This role focuses on enhancing AI infrastructure efficiency by developing telemetry, cost attribution, and optimization frameworks across large-scale distributed systems.
Key Responsibilities
- Build telemetry systems
- Design cost attribution frameworks
- Identify performance bottlenecks
- Optimize cluster configurations
- Collaborate with cloud providers
Technical Overview
The technical scope includes distributed systems, cloud platforms, telemetry, and performance optimization using Python, Rust, Go, and Java, bridging hardware and high-level research needs.
Ideal Candidate
The ideal candidate is a senior software engineer with over 6 years of experience in distributed systems, skilled in Python, Rust, Go, or Java. They have a strong background in infrastructure optimization, telemetry, and cost management for large-scale AI systems.
Must-Have Skills
Nice-to-Have Skills
Tools & Platforms
Required Skills
Hard Skills
Soft Skills
Industry & Role
Keywords for Your Resume
Deal Breakers
Less than 6 years of experience, Lack of expertise in distributed systems, No experience with cloud infrastructure, Unfamiliar with telemetry or cost frameworks
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile