Position Details

Salary $175K – $334K USD / year

Type Not Specified

Experience executive

Exp. Years Not specified

Education Not specified

Category DevOps & SRE

About this role

This role is an executive leadership position for AI Ops Engineering, focused on continuous operation, monitoring, and optimization of CVS Health’s enterprise AI environment. The leader will build the operating model and standards for a greenfield SRE organization spanning reliability, infrastructure, networking, observability, security, and 24/7 operations.

Key Responsibilities

Build and lead a multi-disciplinary AI Platform SRE organization
Establish and enforce SLO/SLI, error budgets, and availability baselines
Implement observability and continuous monitoring/adjustment of changes
Lead infrastructure, network, and security operational standards
Drive high availability, reliability, scalability with automation and self-healing

Technical Overview

The scope includes end-to-end operational ownership for an AI platform: SLO/SLI and error budget management, availability baselines, observability strategy and alerting pipelines, and infrastructure-as-code practices. It also covers security posture (access controls, audit logging, vulnerability management) and compliance with HIPAA and NIST AI RMF.

Ideal Candidate

The ideal candidate is an executive SRE/AI operations leader who has built and run reliable enterprise AI platforms with strong observability. They bring deep experience managing SLO/SLI and error budgets, leading 24/7 operations, and enforcing security and compliance controls for HIPAA and NIST AI RMF.

Must-Have Skills

Executive engineering leadership for AI platform SREEnsure continuous operationmonitoringand optimization of an enterprise AI environmentSLO/SLI and error budget managementObservability and incident/availability improvementOperate across infrastructurenetworkobservabilitysecurityand 24/7 operations

Nice-to-Have Skills

Greenfield operating model and team build-outadvanced automation and self-healing capabilitiesNIST AI RMF experience

Tools & Platforms

Infrastructure-as-Code

Required Skills

Site Reliability Engineering (SRE)SLO/SLIerror budget managementavailability baseline enforcementcluster administrationGPU quota governanceInfrastructure-as-Codeobservabilityalerting pipelinesSLI/SLO dashboardssecurity postureaccess controlsaudit loggingvulnerability managementHIPAANIST AI RMFsecurity segmentationhigh-performance GPU networkingfabric management24/7 operationsautomationself-healing

Hard Skills

Site Reliability Engineering (SRE) leadershipcontinuous operationmonitoring and optimizationoperational baselinesinfrastructure stack operationschange monitoringobservabilityavailabilityreliabilityscalabilitySLO/SLI managementerror budget managementavailability baseline enforcementcluster administrationGPU quota governanceInfrastructure-as-Codeinfrastructure lifecycle managementcompute lifecycle managementstorage lifecycle managementhardware lifecycle managementcompliance controlsdata isolationhigh-performance GPU networkingfabric managementsecurity segmentationnetwork baseline enforcementend-to-end monitoring strategyalerting pipelinesSLI/SLO dashboardsfeedback loops for operational improvementsecurity posture managementaccess controlsaudit loggingvulnerability managementregulatory compliance (HIPAANIST AI RMF)24/7 operationsautomationself-healing capabilitiesoperating model definitionteam culture shapingengineering standards definitionautomation and self-healing

Soft Skills

engineering leadershipbuilding from the ground updefine operating modelshape team cultureestablish engineering standardscross-functional collaboration (infrastructure/AI operations)organizational leadership

Industry & Role

Industry Healthcare IT

Job Function Executive leadership for reliable, observable, and continuously improving enterprise AI platform operations.

Role Subtype Site Reliability Engineer

Tech Domains DevOps & SRE, Cybersecurity, Google Cloud Platform, Amazon Web Services, Linux

Keywords for Your Resume

Executive DirectorAI Ops EngineeringAI Platform SRESite Reliability Engineering (SRE)SLO/SLIerror budgetavailability baseline enforcementcluster administrationGPU quota governanceInfrastructure-as-Codeobservabilityalerting pipelinesSLI/SLO dashboardssecurity postureaccess controlsaudit loggingvulnerability managementHIPAANIST AI RMFsecurity segmentationnetwork baseline enforcementhigh-performance GPU networkingfabric management24/7 operationsautomationself-healing

Deal Breakers

Demonstrated leadership in Site Reliability Engineering (SRE) for an AI/enterprise platform with SLO/SLI and error budget ownership, Experience spanning observability, security controls, and 24/7 operations

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile

Executive Director, AI Ops Engineering

Get matched to jobs like this