Position Details

Type Not Specified

Experience senior

Exp. Years Not specified

Education Not specified

Category DevOps & SRE

About this role

Senior Site Reliability Engineer at Kensho responsible for ensuring reliability, scalability, and security of internal and customer-facing services. The role includes operating production systems, designing resilient infrastructure, automating operations, and maintaining robust monitoring and incident readiness in a 24/7 on-call environment.

Key Responsibilities

Own and operate production services for availability, performance, and reliability
Design, build, and manage AWS infrastructure including EKS-based clusters
Provision infrastructure using Terraform (Infrastructure as Code)
Deploy, scale, and troubleshoot applications running on Kubernetes
Monitor system health with metrics, logs, and alerts; tune dashboards and runbooks

Technical Overview

Build and manage AWS infrastructure including EKS-based clusters using Terraform (Infrastructure as Code). Operate and troubleshoot Kubernetes workloads, implement Python-based automation to reduce toil, and run comprehensive monitoring using metrics, logs, alerts, dashboards, and runbooks, including certificate lifecycle management.

Ideal Candidate

The ideal candidate is a senior Site Reliability Engineer who is a hands-on technologist with strong infrastructure and software engineering skills and a Python-first approach. They have owned production reliability for customer-facing services, built and operated AWS infrastructure using EKS, and implemented Infrastructure as Code with Terraform, including Kubernetes operations and monitoring with metrics, logs, alerts, dashboards, and runbooks.

Must-Have Skills

Senior Site Reliability Engineer (SRE)Python firsthands-on technologistensure the reliabilityscalabilityand security of both business-critical internal systems and externalcustomer facing servicesoperate in a 24/7 on call environment

Tools & Platforms

AWSAmazon Web ServicesEKSElastic Kubernetes Service (EKS)TerraformKubernetesPythonZscalermetricslogsalertsdashboardsrunbooks

Required Skills

Site Reliability Engineering (SRE)production servicesavailabilityperformancereliabilityscalabilitysecurityAWS infrastructureEKS-based clustersTerraform (Infrastructure as Code)Kubernetescluster creationupgradeslifecycle managementPythonautomation frameworksmetricslogsalertsdashboardsrunbookstroubleshootingnetworkingcertificatesdeploymentsapplication behaviorcertificate lifecycle managementInfoSecVulnerability ManagementZscaler24/7 on call

Hard Skills

Site Reliability Engineering (SRE)production servicesavailabilityperformancereliabilityinfrastructure reliabilityscalabilitysecurityAWS infrastructureAmazon Web ServicesEKS-based clustersElastic Kubernetes Service (EKS)Terraform (Infrastructure as Code)Infrastructure as CodeKubernetescluster creationupgradeslifecycle managementPythonautomation frameworksmetricslogsalertsdashboardsrunbookstroubleshootingnetworkingcertificatesdeploymentsapplication behaviorcertificate lifecycle managementexpiration managementsecurity postureInfoSecVulnerability ManagementZscaleron call environment24/7 on call

Soft Skills

hands-on technologiststrong troubleshooting skillsdeep ownership of production systemscollaboration with InfrastructureApplicationand Security teamscollaboration with InfoSecVulnerability Managementand Network Security teamscollaboration with L1/L2 teamscommunication

Industry & Role

Industry SaaS / Artificial Intelligence

Job Function Ensure production reliability through SRE practices, AWS/EKS operations, Terraform IaC, and Kubernetes automation

Role Subtype Site Reliability Engineer

Tech Domains Amazon Web Services, Kubernetes, Terraform, Python, DevOps

Keywords for Your Resume

Senior Site Reliability EngineerSite Reliability Engineer (SRE)SREAWSAmazon Web ServicesEKSElastic Kubernetes Service (EKS)TerraformInfrastructure as CodeKubernetescluster creationupgradeslifecycle managementPythonautomationautomation frameworksmonitoringmetricslogsalertsdashboardsrunbookstroubleshootcertificate lifecycle24/7 on callZscalerInfoSecVulnerability Management

Deal Breakers

Must have Python first experience, Must be able to operate in a 24/7 on call environment, Must have hands-on AWS and Kubernetes/EKS experience, Must have Infrastructure as Code experience with Terraform

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile

Senior Site Reliability Engineer - Infrastructure

Get matched to jobs like this