✦ Luna Orbit — DevOps & SRE

Executive Director, AI Ops Engineering

at CVS Health

Unknown 💰 $175K – $334K USD / year Posted April 14, 2026
Salary $175K – $334K USD / year
Type Not Specified
Experience executive
Exp. Years Not specified
Education Not specified
Category DevOps & SRE

This role is an executive leadership position for AI Ops Engineering, focused on continuous operation, monitoring, and optimization of CVS Health’s enterprise AI environment. The leader will build the operating model and standards for a greenfield SRE organization spanning reliability, infrastructure, networking, observability, security, and 24/7 operations.

  • Build and lead a multi-disciplinary AI Platform SRE organization
  • Establish and enforce SLO/SLI, error budgets, and availability baselines
  • Implement observability and continuous monitoring/adjustment of changes
  • Lead infrastructure, network, and security operational standards
  • Drive high availability, reliability, scalability with automation and self-healing

The scope includes end-to-end operational ownership for an AI platform: SLO/SLI and error budget management, availability baselines, observability strategy and alerting pipelines, and infrastructure-as-code practices. It also covers security posture (access controls, audit logging, vulnerability management) and compliance with HIPAA and NIST AI RMF.

The ideal candidate is an executive SRE/AI operations leader who has built and run reliable enterprise AI platforms with strong observability. They bring deep experience managing SLO/SLI and error budgets, leading 24/7 operations, and enforcing security and compliance controls for HIPAA and NIST AI RMF.

Executive engineering leadership for AI platform SREEnsure continuous operationmonitoringand optimization of an enterprise AI environmentSLO/SLI and error budget managementObservability and incident/availability improvementOperate across infrastructurenetworkobservabilitysecurityand 24/7 operations
Greenfield operating model and team build-outadvanced automation and self-healing capabilitiesNIST AI RMF experience
Infrastructure-as-Code
Site Reliability Engineering (SRE)SLO/SLIerror budget managementavailability baseline enforcementcluster administrationGPU quota governanceInfrastructure-as-Codeobservabilityalerting pipelinesSLI/SLO dashboardssecurity postureaccess controlsaudit loggingvulnerability managementHIPAANIST AI RMFsecurity segmentationhigh-performance GPU networkingfabric management24/7 operationsautomationself-healing
Site Reliability Engineering (SRE) leadershipcontinuous operationmonitoring and optimizationoperational baselinesinfrastructure stack operationschange monitoringobservabilityavailabilityreliabilityscalabilitySLO/SLI managementerror budget managementavailability baseline enforcementcluster administrationGPU quota governanceInfrastructure-as-Codeinfrastructure lifecycle managementcompute lifecycle managementstorage lifecycle managementhardware lifecycle managementcompliance controlsdata isolationhigh-performance GPU networkingfabric managementsecurity segmentationnetwork baseline enforcementend-to-end monitoring strategyalerting pipelinesSLI/SLO dashboardsfeedback loops for operational improvementsecurity posture managementaccess controlsaudit loggingvulnerability managementregulatory compliance (HIPAANIST AI RMF)24/7 operationsautomationself-healing capabilitiesoperating model definitionteam culture shapingengineering standards definitionautomation and self-healing
engineering leadershipbuilding from the ground updefine operating modelshape team cultureestablish engineering standardscross-functional collaboration (infrastructure/AI operations)organizational leadership
Industry Healthcare IT
Job Function Executive leadership for reliable, observable, and continuously improving enterprise AI platform operations.
Role Subtype Site Reliability Engineer
Tech Domains DevOps & SRE, Cybersecurity, Google Cloud Platform, Amazon Web Services, Linux
Executive DirectorAI Ops EngineeringAI Platform SRESite Reliability Engineering (SRE)SLO/SLIerror budgetavailability baseline enforcementcluster administrationGPU quota governanceInfrastructure-as-Codeobservabilityalerting pipelinesSLI/SLO dashboardssecurity postureaccess controlsaudit loggingvulnerability managementHIPAANIST AI RMFsecurity segmentationnetwork baseline enforcementhigh-performance GPU networkingfabric management24/7 operationsautomationself-healing

Demonstrated leadership in Site Reliability Engineering (SRE) for an AI/enterprise platform with SLO/SLI and error budget ownership, Experience spanning observability, security controls, and 24/7 operations

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile