Position Details
About this role
This role involves leading the development and maintenance of scalable, reliable AI platform services using cloud-native tools, with a focus on automation, observability, and incident management.
Key Responsibilities
- Build CI/CD pipelines
- Deploy models with Kubernetes
- Implement observability
- Ensure platform reliability
- Automate incident response
Technical Overview
The technical environment includes Kubernetes, SageMaker, Ray Serve, Terraform, Vault, and AWS/GCP cloud platforms, emphasizing ML Ops and infrastructure automation.
Ideal Candidate
The ideal candidate is a senior DevOps or SRE professional with extensive experience in building scalable, reliable cloud-native systems, particularly with Kubernetes, SageMaker, and infrastructure as code tools like Terraform and Vault.
Must-Have Skills
Nice-to-Have Skills
Tools & Platforms
Required Skills
Hard Skills
Soft Skills
Certifications
Required
Industry & Role
Keywords for Your Resume
Deal Breakers
Lack of experience with Kubernetes or SageMaker, No background in cloud infrastructure, Inability to work in a hybrid environment, No experience with IaC tools like Terraform
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile