About this role
This role involves maintaining and improving the reliability of large-scale cloud platforms using SRE principles, focusing on automation, monitoring, and incident management.
Key Responsibilities
- Maintaining high system availability and reliability
- Designing monitoring and alerting solutions
- Leading incident response and root cause analysis
- Automating operational tasks
- Building self-healing solutions
Technical Overview
The technical environment includes cloud platforms like Azure, Kubernetes clusters (AKS/EKS/GKE), containerization with Docker, infrastructure automation with Terraform, and monitoring tools such as Dynatrace, Datadog, and Prometheus.
Ideal Candidate
The ideal candidate is a mid-level SRE with 5+ years of experience in incident management, cloud platforms, and container orchestration. They possess strong skills in monitoring, automation, and high availability architectures, with proficiency in English and Spanish/Portuguese.
Must-Have Skills
Experience as a Site Reliability Engineer / Incident Management Engineer for 5+ yearsStrong experience in Incident EscalationExperience with Azure cloud platformsExperience with Kubernetes administration (AKS / EKS / GKE)Experience with containerization technologies (Docker)Experience with Infrastructure as Code for 3+ years (Terraform preferred)Understanding high availability architecturesauto-scalingand disaster recovery strategiesExperience with monitoring and APM tools (DynatraceDatadogPrometheusAzure Monitor)Experience with log aggregation systems (ELKLokiSplunk)Experience with distributed tracing solutions (OpenTelemetry preferred)Experience with alert configurationtuningand reduction of alert fatigueExperience defining and tracking SLIs and SLOsLevel of English – from Intermediate+ and aboveLevel of Spanish/Portuguese – from Upper-Intermediate and above
Nice-to-Have Skills
Experience in FinTechHealthcareRetailTelecomExperience with auto-healing solutionsExperience with capacity planning and disaster recovery planning
Tools & Platforms
AzureKubernetesTerraformDynatraceDatadogPrometheusELKLokiSplunkOpenTelemetry
Required Skills
KubernetesAKSEKSGKEDockerTerraformAzureAzure MonitorDynatraceDatadogPrometheusELKLokiSplunkOpenTelemetryIncident ManagementSLIsSLOsAuto-remediationHigh availability architecturesAuto-scalingDisaster recovery
Hard Skills
KubernetesAKSEKSGKEDockerTerraformAzureAzure MonitorDynatraceDatadogPrometheusELKLokiSplunkOpenTelemetryIncident ManagementSLIsSLOsAuto-remediationHigh availability architecturesAuto-scalingDisaster recovery
Soft Skills
CommunicationLeadershipTeamworkProblem-solvingCollaboration
Keywords for Your Resume
Site Reliability EngineerIncident ManagementKubernetesAKSEKSGKEDockerTerraformAzureAzure MonitorDynatraceDatadogPrometheusELKLokiSplunkOpenTelemetrySLIsSLOsAuto-remediationHigh availability architecturesAuto-scalingDisaster recoveryIncident EscalationMonitoringLoggingAlerting
Deal Breakers
Less than 5 years of relevant experience, Lack of experience with Kubernetes or cloud platforms, No proficiency in English or Spanish/Portuguese
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile