Position Details

Salary $160K – $240K USD / year

Type Full-Time

Experience senior

Exp. Years 10+ years’ experience

Education Not specified

Category DevOps & SRE

About this role

Senior Site Reliability Engineer to help scale NinjaOne’s platform reliability and availability. The role emphasizes incident diagnosis, Root Cause Analysis (RCA), observability, automation, and AWS-driven infrastructure improvements.

Key Responsibilities

Diagnose and resolve complex application and infrastructure issues
Participate in 24x7 on-call rotation, SCRUM, and deployment planning
Perform Root Cause Analysis (RCA) and provide recommendations
Improve availability and reduce customer impact using observability tools
Develop software, scripts, or tooling to improve efficiency and reduce delivery time of applications and infrastructure.

Technical Overview

Own reliability and operational improvements across AWS-based services using observability platforms like New Relic, Splunk, and DataDog. Build automation and Infrastructure-as-Code (IaC) with CloudFormation (plus Terraform, Helm, Ansible) and work with containers, Fargate, Kubernetes, and distributed microservice architectures.

Ideal Candidate

The ideal candidate is a senior Site Reliability Engineer/DevOps engineer with 10+ years of experience, strong Linux administration, and deep AWS (Amazon Web Services) operational expertise. They have proven observability skills using New Relic, Splunk, and DataDog and can lead Root Cause Analysis (RCA), improve availability, and automate production reliability through Infrastructure-as-Code (IaC) (primarily CloudFormation, plus Terraform/Helm/Ansible).

Must-Have Skills

10+ years’ experience in DevOps and/or Site Reliability Engineering roles3+ years' experience with an object-oriented language (preferably Java.NET or C++)Intermediate+ level Linux administrationscriptingand troubleshootingDemonstrable knowledge of Observability tools (New RelicSplunkDataDog)Comprehensive experience with Amazon Web Services (AWS) and its core capabilities (VPCEC2ECSRoute53FargateALB/NLB distributionsetc)Experience with cloud automation and infrastructure-as-code (IaC) toolsetsprimarily CloudFormation but also including TerraformHelm and AnsibleHands-on experience with CI/CD and Software Development Life Cycle (SDLC) processesEffective communication skillsboth verbal and writtenParticipate in a 24x7 on-call rotation

Nice-to-Have Skills

Cloud Development Kit (CDK)FargateKubernetescontainersmicroservice architectures

Tools & Platforms

New RelicSplunkDataDogAmazon Web Services (AWS)VPCEC2ECSRoute53FargateALB/NLBCloudFormationTerraformHelmAnsibleCloud Development Kit (CDK)KubernetesLinuxCI/CDSCRUM24x7 on-call rotation

Required Skills

DevOpsSite Reliability EngineeringLinux administrationscriptingtroubleshootingObservabilityNew RelicSplunkDataDogAWSVPCEC2ECSRoute53FargateALB/NLBInfrastructure-as-Code (IaC)CloudFormationTerraformHelmAnsibleCloud Development Kit (CDK)containersKubernetesmicroservice architecturesCI/CDSoftware Development Life Cycle (SDLC)Root Cause Analysis (RCA)technical documentationSOP’s

Hard Skills

Root Cause Analysis (RCA)ObservabilityObservability toolsNew RelicSplunkDataDogAmazon Web Services (AWS)VPCEC2ECSRoute53FargateALB/NLBdistributionsLinux administrationscriptingtroubleshootingInfrastructure-as-Code (IaC)CloudFormationTerraformHelmAnsibleCloud Development Kit (CDK)containersKubernetesmicroservice architecturesCI/CDSoftware Development Life Cycle (SDLC)technical documentationSOP’sdeployment planning24x7 on-call rotationSCRUMapplication security-minded architecturesecurity

Soft Skills

passion for automationpassion for observabilityeffective communication skillsboth verbal and writtenproblem-solvingcross-team influencedocumentationsecurity-minded thinkingparticipation in SCRUM

Industry & Role

Industry SaaS

Job Function Ensure and improve production reliability and scalability for a cloud SaaS platform through AWS operations, observability, automation, and SRE practices.

Role Subtype Site Reliability Engineer

Tech Domains Amazon Web Services, Linux, Kubernetes, DevOps & SRE

Keywords for Your Resume

Senior Site Reliability EngineerSite Reliability EngineeringDevOpsLinux administrationRoot Cause Analysis (RCA)ObservabilityNew RelicSplunkDataDogAmazon Web Services (AWS)VPCEC2ECSRoute53FargateALB/NLBInfrastructure-as-Code (IaC)CloudFormationTerraformHelmAnsibleKubernetesCI/CDSoftware Development Life Cycle (SDLC)Cloud Development Kit (CDK)

Deal Breakers

10+ years’ experience in DevOps and/or Site Reliability Engineering roles, 3+ years with an object-oriented language (preferably Java, .NET or C++), Intermediate+ level Linux administration, scripting, and troubleshooting, Comprehensive Amazon Web Services (AWS) experience (VPC, EC2, ECS, Route53, Fargate, ALB/NLB), Must be able to participate in a 24x7 on-call rotation and be located in the USA eligible states listed

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile