✦ Luna Orbit — DevOps & SRE

Site Reliability Manager - Remote

at Kairos Technologies

📍 Remote, US Remote 💰 $70 – $80 USD / year Posted April 10, 2026
Salary $70 – $80 USD / year
Type Contract
Experience lead
Exp. Years 8+ years
Education Not specified
Category DevOps & SRE

Lead a remote SRE team responsible for reliability of customer-facing platforms. Own SLOs/SLIs and error budgets, drive incident/change best practices, improve observability, and champion automation to reduce toil and increase resilience.

  • Lead and grow the team (hire, coach, blameless culture)
  • Own reliability strategy (SLOs/SLIs, error budgets, guardrails)
  • Operate platform reliability (availability, latency, capacity, change management)
  • Run on-call and escalation (SEV1/2) with blameless postmortems
  • Drive observability and automation (reduce alert noise; IaC, CI/CD, chaos/load testing)

Operate and improve availability, latency, capacity planning, and change management across AWS/Azure/GCP and Kubernetes. Standardize observability using logs/metrics/traces and reduce alert noise with tools such as Datadog, Dynatrace, Prometheus, Grafana, and New Relic, while implementing Infrastructure as Code and CI/CD.

The ideal candidate is an SRE/DevOps leader with 8+ years of platform or reliability engineering experience, including 2–4 years leading SRE or DevOps/Platform teams. They have hands-on expertise operating large-scale services on AWS/Azure/GCP with Kubernetes, strong Linux and distributed systems fundamentals, and a track record of owning on-call programs, SLOs/SLIs, error budgets, and observability improvements.

8+ years in software/platform/reliability engineering2–4 years leading SRE/DevOps/Platform teamsoperating large-scale services on AWS/Azure/GCP with Kubernetes and containersLinuxnetworkingdistributed systemsIaCTerraformCloudFormationBicepCI/CDGitHub ActionsCircleCIAzure DevOpsone scripting language (Python/Go/Bash)observability (metricslogstraces)alerting best practiceson-call programsblameless postmortems
AWSAmazon Web ServicesAzureGoogle Cloud PlatformGCPKubernetesTerraformCloudFormationBicepGitHub ActionsCircleCIAzure DevOpsDatadogDynatracePrometheusGrafanaNew RelicAPMmonitoring tools
SLOsSLIserror budgetsincident managementSEV1/2blameless postmortemsobservabilitymetricslogstracesalerting best practicesInfrastructure as CodeIaCTerraformCloudFormationBicepCI/CDGitHub ActionsCircleCIAzure DevOpsscripting (Python/Go/Bash)on-call programsKubernetesLinuxnetworkingdistributed systemsautomationchaos/game daysload testingtoil reductionleast-privilegesecrets managementaudit readiness
SLOsSLIserror budgetsincident managementSEV1/2blameless postmortemsavailabilitylatencycapacity planningchange managementobservabilitymetricslogstracesalertinginfra-as-codeInfrastructure as CodeTerraformCloudFormationBicepCI/CDGitHub ActionsCircleCIAzure DevOpsPythonGoBashautomationresiliencechaos/game daysload testingtoil reductionleast-privilegesecrets managementaudit readinessLinuxnetworkingdistributed systemsKubernetescontainersAWSAmazon Web ServicesAzureGoogle Cloud PlatformGCP
leadershiphirecoachand develop SREsset goalsblamelessdata-driven culturestakeholder managementcommunicationpresenting trade-offs and data to executivescomfortable presenting trade-offscoaching engineers
Industry SaaS
Job Function Provide SRE leadership to ensure uptime, performance, and operational excellence for cloud and Kubernetes-based customer platforms
Role Subtype Site Reliability Engineer
Tech Domains Amazon Web Services, Azure, Google Cloud Platform, Kubernetes, Linux, DevOps & SRE, Cybersecurity, Python
Site Reliability ManagerSRE ManagerSr SRE ManagerSite Reliability EngineeringDevOpsPlatform teamsSLOs/SLIserror budgetsincident managementSEV1/2blameless postmortemson-callavailabilitylatencycapacity planningchange managementobservabilitymetricslogstracesalert noiseDatadogDynatracePrometheusGrafanaNew RelicInfrastructure as CodeIaCTerraformCloudFormationBicepCI/CDGitHub ActionsCircleCIAzure DevOpsKubernetesLinuxdistributed systemsautomationload testingchaos/game days

8+ years in software/platform/reliability engineering, 2–4 years leading SRE/DevOps/Platform teams, Proven experience operating large-scale services on AWS/Azure/GCP with Kubernetes and containers, Hands-on IaC (Terraform/CloudFormation/Bicep) and CI/CD (GitHub Actions/CircleCI/Azure DevOps), Must be located such that the role can be performed Remote, US

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile