Position Details

Salary $191K – $297K USD / year

Type Not Specified

Experience senior

Exp. Years 5+ years

Education Not specified

Category DevOps & SRE

About this role

Sr. Manager, Site Reliability Engineering will lead and scale reliability engineering for a complex multi-cloud platform, owning availability, performance, monitoring, incident response, and continuous improvement. The role is hands-on as well as managerial, with strong emphasis on automation and observability.

Key Responsibilities

Lead and mentor an SRE team with platform ownership
Own availability and performance through monitoring, incident response, and root cause analysis
Drive automation to reduce manual toil across deployment and scaling
Define observability strategies using OpenTelemetry, CloudWatch, Amazon Timestream, Splunk
Set SLOs, SLAs, and error budgets and perform capacity planning and performance tuning

Technical Overview

Own reliability for services running on Kubernetes across AWS EKS and GCP GKE. Build observability using OpenTelemetry, CloudWatch, Amazon Timestream, and Splunk; implement operational automation and AI-augmented on-call workflows; establish SLOs/SLAs/error budgets; and strengthen CI/CD with supply chain security (container signing, SBOM, OPA policy validation) with Kafka expertise.

Ideal Candidate

The ideal candidate is a senior SRE/DevOps leader with 5+ years of infrastructure or SRE experience and at least 4 years in a leadership role managing multi-team/platform engineering. They have deep hands-on expertise in AWS and GCP, Kubernetes, observability with OpenTelemetry/CloudWatch/Splunk, and they drive reliability using SLOs, incident response, root cause analysis, and automation.

Must-Have Skills

5+ years in SREDevOpsor infrastructure engineering4+ years in a leadership roleStrong expertise in cloud platforms (AWS and GCP)container orchestration (KubernetesEKS)CI/CD pipelines including supply chain security (container signingSBOMOPA policy validation)Hands-on experience with OpenTelemetryHands-on experience with CloudWatchHands-on experience with Amazon TimestreamHands-on experience with SplunkSLOsSLAsand error budgets

Nice-to-Have Skills

Champion AI-Augmented Operationsadoption of AI tooling across SRE workflows including automated incident triageanomaly detectionAI-assisted on-call response

Tools & Platforms

OpenTelemetryCloudWatchAmazon TimestreamSplunkKubernetesAWS EKSAmazon Elastic Kubernetes ServiceGCP GKEGoogle Kubernetes EngineKafkaCI/CD pipelinescontainer signingSBOMSoftware Bill of MaterialsOPAOpen Policy Agent

Required Skills

Site Reliability EngineeringSREDevOpsavailabilityperformanceproactive monitoringincident responseroot cause analysisautomationOpenTelemetryCloudWatchAmazon TimestreamSplunkKubernetesAWS EKSGCP GKEHPAVPAKEDASLOsSLAserror budgetscapacity planningperformance tuningCI/CD pipelinessupply chain securitycontainer signingSBOMOPA policy validationKafkaPythonGoJava

Hard Skills

Site Reliability EngineeringSRE team leadershipavailabilityperformance of critical servicesproactive monitoringincident responseroot cause analysisautomationobservability strategiesOpenTelemetryCloudWatchAmazon TimestreamSplunkmulti-cloud Kubernetes platformKubernetesAWS EKSAmazon Elastic Kubernetes ServiceGCP GKEGoogle Kubernetes EngineHPAHorizontal Pod AutoscalerVPAVertical Pod AutoscalerKEDAKubernetes Event-Driven AutoscalingSLOsService Level ObjectivesSLAsService Level Agreementserror budgetscapacity planningperformance tuningCI/CD pipelinessupply chain securitycontainer signingSBOMSoftware Bill of MaterialsOPA policy validationOpen Policy Agent policy validationKafkaPythonGoJavaAI-Augmented Operationsautomated incident triageanomaly detectionAI-assisted on-call responseOpenTelemetryobservability stack

Soft Skills

lead & inspirebuild and mentor a high-performing SRE teamculture of growthinitiativecontinuous improvementclearconciseand collaborative communicationtranslate technical complexity for executive audiencescollaborate and align across engineeringproductand operationsreport progress regularly to executive leadership

Industry & Role

Industry Retail

Job Function Provide strategic and hands-on leadership for SRE reliability at scale across multi-cloud Kubernetes infrastructure.

Role Subtype Site Reliability Engineer

Tech Domains Amazon Web Services, Google Cloud Platform, Kubernetes, DevOps & SRE, Python, Java, Kafka

Keywords for Your Resume

Sr. ManagerSite Reliability EngineeringSREDevOpsavailabilityperformanceproactive monitoringincident responseroot cause analysisautomationOpenTelemetryCloudWatchAmazon TimestreamSplunkKubernetesAWS EKSAmazon Elastic Kubernetes ServiceGCP GKEGoogle Kubernetes EngineHPAHorizontal Pod AutoscalerVPAVertical Pod AutoscalerKEDAKubernetes Event-Driven AutoscalingSLOsService Level ObjectivesSLAsService Level Agreementserror budgetscapacity planningperformance tuningCI/CD pipelinessupply chain securitycontainer signingSBOMSoftware Bill of MaterialsOPA policy validationOpen Policy Agent policy validationKafkaPythonGoJavaSLOs and SLAs

Deal Breakers

5+ years in SRE, DevOps, or infrastructure engineering, 4+ years in a leadership role

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile

Sr. Manager, Site Reliability Engineering (Hybrid - Seattle, WA)

Get matched to jobs like this