✦ Luna Orbit — DevOps & SRE

Sr. Manager, Site Reliability Engineering (Hybrid - Seattle, WA)

at Nordstrom

📍 Seattle, WA Hybrid 💰 $191K – $297K USD / year Posted April 17, 2026
Salary $191K – $297K USD / year
Type Not Specified
Experience senior
Exp. Years 5+ years
Education Not specified
Category DevOps & SRE

Sr. Manager, Site Reliability Engineering will lead and scale reliability engineering for a complex multi-cloud platform, owning availability, performance, monitoring, incident response, and continuous improvement. The role is hands-on as well as managerial, with strong emphasis on automation and observability.

  • Lead and mentor an SRE team with platform ownership
  • Own availability and performance through monitoring, incident response, and root cause analysis
  • Drive automation to reduce manual toil across deployment and scaling
  • Define observability strategies using OpenTelemetry, CloudWatch, Amazon Timestream, Splunk
  • Set SLOs, SLAs, and error budgets and perform capacity planning and performance tuning

Own reliability for services running on Kubernetes across AWS EKS and GCP GKE. Build observability using OpenTelemetry, CloudWatch, Amazon Timestream, and Splunk; implement operational automation and AI-augmented on-call workflows; establish SLOs/SLAs/error budgets; and strengthen CI/CD with supply chain security (container signing, SBOM, OPA policy validation) with Kafka expertise.

The ideal candidate is a senior SRE/DevOps leader with 5+ years of infrastructure or SRE experience and at least 4 years in a leadership role managing multi-team/platform engineering. They have deep hands-on expertise in AWS and GCP, Kubernetes, observability with OpenTelemetry/CloudWatch/Splunk, and they drive reliability using SLOs, incident response, root cause analysis, and automation.

5+ years in SREDevOpsor infrastructure engineering4+ years in a leadership roleStrong expertise in cloud platforms (AWS and GCP)container orchestration (KubernetesEKS)CI/CD pipelines including supply chain security (container signingSBOMOPA policy validation)Hands-on experience with OpenTelemetryHands-on experience with CloudWatchHands-on experience with Amazon TimestreamHands-on experience with SplunkSLOsSLAsand error budgets
Champion AI-Augmented Operationsadoption of AI tooling across SRE workflows including automated incident triageanomaly detectionAI-assisted on-call response
OpenTelemetryCloudWatchAmazon TimestreamSplunkKubernetesAWS EKSAmazon Elastic Kubernetes ServiceGCP GKEGoogle Kubernetes EngineKafkaCI/CD pipelinescontainer signingSBOMSoftware Bill of MaterialsOPAOpen Policy Agent
Site Reliability EngineeringSREDevOpsavailabilityperformanceproactive monitoringincident responseroot cause analysisautomationOpenTelemetryCloudWatchAmazon TimestreamSplunkKubernetesAWS EKSGCP GKEHPAVPAKEDASLOsSLAserror budgetscapacity planningperformance tuningCI/CD pipelinessupply chain securitycontainer signingSBOMOPA policy validationKafkaPythonGoJava
Site Reliability EngineeringSRE team leadershipavailabilityperformance of critical servicesproactive monitoringincident responseroot cause analysisautomationobservability strategiesOpenTelemetryCloudWatchAmazon TimestreamSplunkmulti-cloud Kubernetes platformKubernetesAWS EKSAmazon Elastic Kubernetes ServiceGCP GKEGoogle Kubernetes EngineHPAHorizontal Pod AutoscalerVPAVertical Pod AutoscalerKEDAKubernetes Event-Driven AutoscalingSLOsService Level ObjectivesSLAsService Level Agreementserror budgetscapacity planningperformance tuningCI/CD pipelinessupply chain securitycontainer signingSBOMSoftware Bill of MaterialsOPA policy validationOpen Policy Agent policy validationKafkaPythonGoJavaAI-Augmented Operationsautomated incident triageanomaly detectionAI-assisted on-call responseOpenTelemetryobservability stack
lead & inspirebuild and mentor a high-performing SRE teamculture of growthinitiativecontinuous improvementclearconciseand collaborative communicationtranslate technical complexity for executive audiencescollaborate and align across engineeringproductand operationsreport progress regularly to executive leadership
Industry Retail
Job Function Provide strategic and hands-on leadership for SRE reliability at scale across multi-cloud Kubernetes infrastructure.
Role Subtype Site Reliability Engineer
Tech Domains Amazon Web Services, Google Cloud Platform, Kubernetes, DevOps & SRE, Python, Java, Kafka
Sr. ManagerSite Reliability EngineeringSREDevOpsavailabilityperformanceproactive monitoringincident responseroot cause analysisautomationOpenTelemetryCloudWatchAmazon TimestreamSplunkKubernetesAWS EKSAmazon Elastic Kubernetes ServiceGCP GKEGoogle Kubernetes EngineHPAHorizontal Pod AutoscalerVPAVertical Pod AutoscalerKEDAKubernetes Event-Driven AutoscalingSLOsService Level ObjectivesSLAsService Level Agreementserror budgetscapacity planningperformance tuningCI/CD pipelinessupply chain securitycontainer signingSBOMSoftware Bill of MaterialsOPA policy validationOpen Policy Agent policy validationKafkaPythonGoJavaSLOs and SLAs

5+ years in SRE, DevOps, or infrastructure engineering, 4+ years in a leadership role

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile