About this role
This role is an executive leadership position for AI Ops Engineering, focused on continuous operation, monitoring, and optimization of CVS Health’s enterprise AI environment. The leader will build the operating model and standards for a greenfield SRE organization spanning reliability, infrastructure, networking, observability, security, and 24/7 operations.
Key Responsibilities
- Build and lead a multi-disciplinary AI Platform SRE organization
- Establish and enforce SLO/SLI, error budgets, and availability baselines
- Implement observability and continuous monitoring/adjustment of changes
- Lead infrastructure, network, and security operational standards
- Drive high availability, reliability, scalability with automation and self-healing
Technical Overview
The scope includes end-to-end operational ownership for an AI platform: SLO/SLI and error budget management, availability baselines, observability strategy and alerting pipelines, and infrastructure-as-code practices. It also covers security posture (access controls, audit logging, vulnerability management) and compliance with HIPAA and NIST AI RMF.
Ideal Candidate
The ideal candidate is an executive SRE/AI operations leader who has built and run reliable enterprise AI platforms with strong observability. They bring deep experience managing SLO/SLI and error budgets, leading 24/7 operations, and enforcing security and compliance controls for HIPAA and NIST AI RMF.
Must-Have Skills
Executive engineering leadership for AI platform SREEnsure continuous operationmonitoringand optimization of an enterprise AI environmentSLO/SLI and error budget managementObservability and incident/availability improvementOperate across infrastructurenetworkobservabilitysecurityand 24/7 operations
Nice-to-Have Skills
Greenfield operating model and team build-outadvanced automation and self-healing capabilitiesNIST AI RMF experience
Tools & Platforms
Infrastructure-as-Code
Required Skills
Site Reliability Engineering (SRE)SLO/SLIerror budget managementavailability baseline enforcementcluster administrationGPU quota governanceInfrastructure-as-Codeobservabilityalerting pipelinesSLI/SLO dashboardssecurity postureaccess controlsaudit loggingvulnerability managementHIPAANIST AI RMFsecurity segmentationhigh-performance GPU networkingfabric management24/7 operationsautomationself-healing
Hard Skills
Site Reliability Engineering (SRE) leadershipcontinuous operationmonitoring and optimizationoperational baselinesinfrastructure stack operationschange monitoringobservabilityavailabilityreliabilityscalabilitySLO/SLI managementerror budget managementavailability baseline enforcementcluster administrationGPU quota governanceInfrastructure-as-Codeinfrastructure lifecycle managementcompute lifecycle managementstorage lifecycle managementhardware lifecycle managementcompliance controlsdata isolationhigh-performance GPU networkingfabric managementsecurity segmentationnetwork baseline enforcementend-to-end monitoring strategyalerting pipelinesSLI/SLO dashboardsfeedback loops for operational improvementsecurity posture managementaccess controlsaudit loggingvulnerability managementregulatory compliance (HIPAANIST AI RMF)24/7 operationsautomationself-healing capabilitiesoperating model definitionteam culture shapingengineering standards definitionautomation and self-healing
Soft Skills
engineering leadershipbuilding from the ground updefine operating modelshape team cultureestablish engineering standardscross-functional collaboration (infrastructure/AI operations)organizational leadership
Keywords for Your Resume
Executive DirectorAI Ops EngineeringAI Platform SRESite Reliability Engineering (SRE)SLO/SLIerror budgetavailability baseline enforcementcluster administrationGPU quota governanceInfrastructure-as-Codeobservabilityalerting pipelinesSLI/SLO dashboardssecurity postureaccess controlsaudit loggingvulnerability managementHIPAANIST AI RMFsecurity segmentationnetwork baseline enforcementhigh-performance GPU networkingfabric management24/7 operationsautomationself-healing
Deal Breakers
Demonstrated leadership in Site Reliability Engineering (SRE) for an AI/enterprise platform with SLO/SLI and error budget ownership, Experience spanning observability, security controls, and 24/7 operations
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile