About this role
Sr. Manager, Site Reliability Engineering will lead and scale reliability engineering for a complex multi-cloud platform, owning availability, performance, monitoring, incident response, and continuous improvement. The role is hands-on as well as managerial, with strong emphasis on automation and observability.
Key Responsibilities
- Lead and mentor an SRE team with platform ownership
- Own availability and performance through monitoring, incident response, and root cause analysis
- Drive automation to reduce manual toil across deployment and scaling
- Define observability strategies using OpenTelemetry, CloudWatch, Amazon Timestream, Splunk
- Set SLOs, SLAs, and error budgets and perform capacity planning and performance tuning
Technical Overview
Own reliability for services running on Kubernetes across AWS EKS and GCP GKE. Build observability using OpenTelemetry, CloudWatch, Amazon Timestream, and Splunk; implement operational automation and AI-augmented on-call workflows; establish SLOs/SLAs/error budgets; and strengthen CI/CD with supply chain security (container signing, SBOM, OPA policy validation) with Kafka expertise.
Ideal Candidate
The ideal candidate is a senior SRE/DevOps leader with 5+ years of infrastructure or SRE experience and at least 4 years in a leadership role managing multi-team/platform engineering. They have deep hands-on expertise in AWS and GCP, Kubernetes, observability with OpenTelemetry/CloudWatch/Splunk, and they drive reliability using SLOs, incident response, root cause analysis, and automation.
Must-Have Skills
5+ years in SREDevOpsor infrastructure engineering4+ years in a leadership roleStrong expertise in cloud platforms (AWS and GCP)container orchestration (KubernetesEKS)CI/CD pipelines including supply chain security (container signingSBOMOPA policy validation)Hands-on experience with OpenTelemetryHands-on experience with CloudWatchHands-on experience with Amazon TimestreamHands-on experience with SplunkSLOsSLAsand error budgets
Nice-to-Have Skills
Champion AI-Augmented Operationsadoption of AI tooling across SRE workflows including automated incident triageanomaly detectionAI-assisted on-call response
Tools & Platforms
OpenTelemetryCloudWatchAmazon TimestreamSplunkKubernetesAWS EKSAmazon Elastic Kubernetes ServiceGCP GKEGoogle Kubernetes EngineKafkaCI/CD pipelinescontainer signingSBOMSoftware Bill of MaterialsOPAOpen Policy Agent
Required Skills
Site Reliability EngineeringSREDevOpsavailabilityperformanceproactive monitoringincident responseroot cause analysisautomationOpenTelemetryCloudWatchAmazon TimestreamSplunkKubernetesAWS EKSGCP GKEHPAVPAKEDASLOsSLAserror budgetscapacity planningperformance tuningCI/CD pipelinessupply chain securitycontainer signingSBOMOPA policy validationKafkaPythonGoJava
Hard Skills
Site Reliability EngineeringSRE team leadershipavailabilityperformance of critical servicesproactive monitoringincident responseroot cause analysisautomationobservability strategiesOpenTelemetryCloudWatchAmazon TimestreamSplunkmulti-cloud Kubernetes platformKubernetesAWS EKSAmazon Elastic Kubernetes ServiceGCP GKEGoogle Kubernetes EngineHPAHorizontal Pod AutoscalerVPAVertical Pod AutoscalerKEDAKubernetes Event-Driven AutoscalingSLOsService Level ObjectivesSLAsService Level Agreementserror budgetscapacity planningperformance tuningCI/CD pipelinessupply chain securitycontainer signingSBOMSoftware Bill of MaterialsOPA policy validationOpen Policy Agent policy validationKafkaPythonGoJavaAI-Augmented Operationsautomated incident triageanomaly detectionAI-assisted on-call responseOpenTelemetryobservability stack
Soft Skills
lead & inspirebuild and mentor a high-performing SRE teamculture of growthinitiativecontinuous improvementclearconciseand collaborative communicationtranslate technical complexity for executive audiencescollaborate and align across engineeringproductand operationsreport progress regularly to executive leadership
Keywords for Your Resume
Sr. ManagerSite Reliability EngineeringSREDevOpsavailabilityperformanceproactive monitoringincident responseroot cause analysisautomationOpenTelemetryCloudWatchAmazon TimestreamSplunkKubernetesAWS EKSAmazon Elastic Kubernetes ServiceGCP GKEGoogle Kubernetes EngineHPAHorizontal Pod AutoscalerVPAVertical Pod AutoscalerKEDAKubernetes Event-Driven AutoscalingSLOsService Level ObjectivesSLAsService Level Agreementserror budgetscapacity planningperformance tuningCI/CD pipelinessupply chain securitycontainer signingSBOMSoftware Bill of MaterialsOPA policy validationOpen Policy Agent policy validationKafkaPythonGoJavaSLOs and SLAs
Deal Breakers
5+ years in SRE, DevOps, or infrastructure engineering, 4+ years in a leadership role
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile