Position Details

Salary $70 – $80 USD / year

Type Contract

Experience lead

Exp. Years 8+ years

Education Not specified

Category DevOps & SRE

About this role

Lead a remote SRE team responsible for reliability of customer-facing platforms. Own SLOs/SLIs and error budgets, drive incident/change best practices, improve observability, and champion automation to reduce toil and increase resilience.

Key Responsibilities

Lead and grow the team (hire, coach, blameless culture)
Own reliability strategy (SLOs/SLIs, error budgets, guardrails)
Operate platform reliability (availability, latency, capacity, change management)
Run on-call and escalation (SEV1/2) with blameless postmortems
Drive observability and automation (reduce alert noise; IaC, CI/CD, chaos/load testing)

Technical Overview

Operate and improve availability, latency, capacity planning, and change management across AWS/Azure/GCP and Kubernetes. Standardize observability using logs/metrics/traces and reduce alert noise with tools such as Datadog, Dynatrace, Prometheus, Grafana, and New Relic, while implementing Infrastructure as Code and CI/CD.

Ideal Candidate

The ideal candidate is an SRE/DevOps leader with 8+ years of platform or reliability engineering experience, including 2–4 years leading SRE or DevOps/Platform teams. They have hands-on expertise operating large-scale services on AWS/Azure/GCP with Kubernetes, strong Linux and distributed systems fundamentals, and a track record of owning on-call programs, SLOs/SLIs, error budgets, and observability improvements.

Must-Have Skills

8+ years in software/platform/reliability engineering2–4 years leading SRE/DevOps/Platform teamsoperating large-scale services on AWS/Azure/GCP with Kubernetes and containersLinuxnetworkingdistributed systemsIaCTerraformCloudFormationBicepCI/CDGitHub ActionsCircleCIAzure DevOpsone scripting language (Python/Go/Bash)observability (metricslogstraces)alerting best practiceson-call programsblameless postmortems

Tools & Platforms

AWSAmazon Web ServicesAzureGoogle Cloud PlatformGCPKubernetesTerraformCloudFormationBicepGitHub ActionsCircleCIAzure DevOpsDatadogDynatracePrometheusGrafanaNew RelicAPMmonitoring tools

Required Skills

SLOsSLIserror budgetsincident managementSEV1/2blameless postmortemsobservabilitymetricslogstracesalerting best practicesInfrastructure as CodeIaCTerraformCloudFormationBicepCI/CDGitHub ActionsCircleCIAzure DevOpsscripting (Python/Go/Bash)on-call programsKubernetesLinuxnetworkingdistributed systemsautomationchaos/game daysload testingtoil reductionleast-privilegesecrets managementaudit readiness

Hard Skills

SLOsSLIserror budgetsincident managementSEV1/2blameless postmortemsavailabilitylatencycapacity planningchange managementobservabilitymetricslogstracesalertinginfra-as-codeInfrastructure as CodeTerraformCloudFormationBicepCI/CDGitHub ActionsCircleCIAzure DevOpsPythonGoBashautomationresiliencechaos/game daysload testingtoil reductionleast-privilegesecrets managementaudit readinessLinuxnetworkingdistributed systemsKubernetescontainersAWSAmazon Web ServicesAzureGoogle Cloud PlatformGCP

Soft Skills

leadershiphirecoachand develop SREsset goalsblamelessdata-driven culturestakeholder managementcommunicationpresenting trade-offs and data to executivescomfortable presenting trade-offscoaching engineers

Industry & Role

Industry SaaS

Job Function Provide SRE leadership to ensure uptime, performance, and operational excellence for cloud and Kubernetes-based customer platforms

Role Subtype Site Reliability Engineer

Tech Domains Amazon Web Services, Azure, Google Cloud Platform, Kubernetes, Linux, DevOps & SRE, Cybersecurity, Python

Keywords for Your Resume

Site Reliability ManagerSRE ManagerSr SRE ManagerSite Reliability EngineeringDevOpsPlatform teamsSLOs/SLIserror budgetsincident managementSEV1/2blameless postmortemson-callavailabilitylatencycapacity planningchange managementobservabilitymetricslogstracesalert noiseDatadogDynatracePrometheusGrafanaNew RelicInfrastructure as CodeIaCTerraformCloudFormationBicepCI/CDGitHub ActionsCircleCIAzure DevOpsKubernetesLinuxdistributed systemsautomationload testingchaos/game days

Deal Breakers

8+ years in software/platform/reliability engineering, 2–4 years leading SRE/DevOps/Platform teams, Proven experience operating large-scale services on AWS/Azure/GCP with Kubernetes and containers, Hands-on IaC (Terraform/CloudFormation/Bicep) and CI/CD (GitHub Actions/CircleCI/Azure DevOps), Must be located such that the role can be performed Remote, US

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile

Site Reliability Manager - Remote

Get matched to jobs like this