Position Details

Type Not Specified

Experience mid

Exp. Years 5+ years

Education Not specified

Category DevOps & SRE

About this role

This role involves maintaining and improving the reliability of large-scale cloud platforms using SRE principles, focusing on automation, monitoring, and incident management.

Key Responsibilities

Maintaining high system availability and reliability
Designing monitoring and alerting solutions
Leading incident response and root cause analysis
Automating operational tasks
Building self-healing solutions

Technical Overview

The technical environment includes cloud platforms like Azure, Kubernetes clusters (AKS/EKS/GKE), containerization with Docker, infrastructure automation with Terraform, and monitoring tools such as Dynatrace, Datadog, and Prometheus.

Ideal Candidate

The ideal candidate is a mid-level SRE with 5+ years of experience in incident management, cloud platforms, and container orchestration. They possess strong skills in monitoring, automation, and high availability architectures, with proficiency in English and Spanish/Portuguese.

Must-Have Skills

Experience as a Site Reliability Engineer / Incident Management Engineer for 5+ yearsStrong experience in Incident EscalationExperience with Azure cloud platformsExperience with Kubernetes administration (AKS / EKS / GKE)Experience with containerization technologies (Docker)Experience with Infrastructure as Code for 3+ years (Terraform preferred)Understanding high availability architecturesauto-scalingand disaster recovery strategiesExperience with monitoring and APM tools (DynatraceDatadogPrometheusAzure Monitor)Experience with log aggregation systems (ELKLokiSplunk)Experience with distributed tracing solutions (OpenTelemetry preferred)Experience with alert configurationtuningand reduction of alert fatigueExperience defining and tracking SLIs and SLOsLevel of English – from Intermediate+ and aboveLevel of Spanish/Portuguese – from Upper-Intermediate and above

Nice-to-Have Skills

Experience in FinTechHealthcareRetailTelecomExperience with auto-healing solutionsExperience with capacity planning and disaster recovery planning

Tools & Platforms

AzureKubernetesTerraformDynatraceDatadogPrometheusELKLokiSplunkOpenTelemetry

Required Skills

KubernetesAKSEKSGKEDockerTerraformAzureAzure MonitorDynatraceDatadogPrometheusELKLokiSplunkOpenTelemetryIncident ManagementSLIsSLOsAuto-remediationHigh availability architecturesAuto-scalingDisaster recovery

Hard Skills

Soft Skills

CommunicationLeadershipTeamworkProblem-solvingCollaboration

Industry & Role

Industry IT & Cloud Services

Job Function Ensure cloud platform reliability and automation using SRE best practices

Keywords for Your Resume

Site Reliability EngineerIncident ManagementKubernetesAKSEKSGKEDockerTerraformAzureAzure MonitorDynatraceDatadogPrometheusELKLokiSplunkOpenTelemetrySLIsSLOsAuto-remediationHigh availability architecturesAuto-scalingDisaster recoveryIncident EscalationMonitoringLoggingAlerting

Deal Breakers

Less than 5 years of relevant experience, Lack of experience with Kubernetes or cloud platforms, No proficiency in English or Spanish/Portuguese

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile

SRE Engineer with Spanish/Portuguese

Get matched to jobs like this