About this role

This role involves leading site reliability engineering efforts, defining SLIs and SLOs, automating workflows, and ensuring the high availability of critical systems in a cloud environment.

Key Responsibilities

Define SLIs and SLOs
Automate workflows and pipelines
Manage incident response and postmortems
Conduct capacity planning and performance tuning
Mentor junior team members

Technical Overview

The technical environment includes SRE practices, performance monitoring tools like Dynatrace and Splunk, container orchestration with Docker and Kubernetes, and cloud infrastructure, with a focus on automation, incident management, and capacity planning.

Ideal Candidate

The ideal candidate is a highly experienced SRE professional with over 15 years in software engineering and architecture, specializing in reliability, automation, and cloud environments. They possess deep expertise in performance monitoring, incident response, and capacity planning, with leadership skills to mentor teams.

Must-Have Skills

15+ years of experience in SRESoftware engineering and architectureStrong knowledge in Performance Monitoring ToolsExperience with middleware and databasesExperience with SDLC and Agile methodologiesCapacity Planning and Demand Forecasting

Nice-to-Have Skills

Experience with NoSQL databasesExperience with security practicesExperience with automation toolsExperience with cloud platforms

Tools & Platforms

DynatraceSplunkDockerKubernetesRelational databasesNoSQL databases

Required Skills

Site Reliability EngineeringSRESLIsSLOsError BudgetsScriptingCI/CDDockerKubernetesCapacity PlanningPerformance MonitoringPostmortemsIncident ResponseAutomationCloud EnvironmentsPerformance TuningResource OptimizationSecurity Best Practices

Hard Skills

Site Reliability EngineeringSRESLIsSLOsError BudgetsScriptingCI/CDDockerKubernetesCapacity PlanningPerformance MonitoringPostmortemsIncident ResponseAutomationCloud EnvironmentsPerformance TuningResource OptimizationSecurity Best Practices

Soft Skills

LeadershipConflict ResolutionMentoringCollaborationDecision MakingProblem Solving

Industry & Role

Industry Technology / Logistics

Job Function Lead site reliability engineering and automation for high-availability systems

Keywords for Your Resume

Site Reliability EngineeringSRESLIsSLOsError BudgetsScriptingCI/CDDockerKubernetesCapacity PlanningPerformance MonitoringPostmortemsIncident ResponseAutomationCloud EnvironmentsPerformance TuningResource OptimizationSecurity Best Practices

Deal Breakers

Less than 15 years of experience in SRE or software engineering, Lack of experience with cloud and container orchestration, No experience with performance monitoring tools, Poor understanding of incident management

Sr. Logistics Engineer, Warehousing Engineer

Get matched to jobs like this