Position Details

Type Temp-to-Hire

Experience mid

Exp. Years 5+ years

Education Undergraduate degree or equivalent experience/certification

Category DevOps & SRE

About this role

This temporary FLEX role focuses on ensuring enterprise IT service availability and peak performance through proactive Site Reliability Engineering and incident command leadership. The position emphasizes automation, cloud technologies, and continuous process improvement to minimize disruptions and strengthen the technology landscape.

Key Responsibilities

Serve as Incident Commander during major incidents, leading response efforts to restore services and minimize impact on business and consumer operations
Design and implement automation tools to reduce manual intervention and improve system performance
Perform proactive service reliability engineering and continuous process improvement
Manage and document incident, problem, change, and release management activities
Use cloud, infrastructure as code, and containerization technologies to enhance availability and reliability

Technical Overview

You will lead major incident response as Incident Commander, improve reliability with monitoring/performance/capacity tooling, and build automation using Python/Shell plus Ansible and Jenkins. The role works across cloud platforms (AWS, Azure, GCP) using infrastructure as code and containerization technologies, with strong IT Operations and incident/problem/change/release management practices.

Ideal Candidate

The ideal candidate is a 5+ year IT operations professional with 2+ years of incident, problem, change, and release management experience, including leading calls and documenting outcomes. They are hands-on with Python and Shell scripting, automation using Ansible and Jenkins, and reliability work across cloud platforms (AWS, Azure, GCP) with infrastructure as code and containerization. They can serve as Incident Commander in a 24x7x365 environment and bring calm, decisive leadership during major incidents.

Must-Have Skills

5+ years of experience in an information technology environment.3 years of experience in information technology focused on IT Operations that include troubleshooting complex networkserverstorageand/or application issues.2 years minimum operations experience involving incidentproblemchangeand release management that included leading calls and documenting outcomes.Ability to cover shifts in a 24x7x365 environment and on-call responsibilities.Proficiency in scripting languages (PythonShell) and familiarity with automation tools (such as AnsibleJenkins).Experience with cloud platforms (AWSAzureGCP)infrastructure as codeand containerization technologies.Experience in incident command or incident management in a technology environment.Undergraduate degree or or equivalent experience/certification.

Nice-to-Have Skills

ITIL Foundations v3+ CertificationDemonstrated experience with ITSM suitese.g.ServiceNow.Demonstrated experience with various monitoringperformanceor capacity tools.Experience with continuous integration/continuous deployment (CI/CD) pipelines and DevOps practices.Familiarity with Site Reliability Engineering principles and concepts.Strong leadership qualitiesincluding decisivenessand the ability to motivate teamsalong with the ability to manage stressful situations calmly and effectively.Ability to create constructive relationshipsinfluenceand communicate with varying levels of associates and management.Ability to solve complexcross-functional issues.Strong knowledge of ServerStorageNetworkMiddlewareApplication and Cloud technologies.A high degree of curiosity and a drive to seek more efficient ways of delivering service.

Tools & Platforms

AnsibleJenkinsServiceNowAWSAmazon Web ServicesAzureGCPGoogle Cloud Platform

Required Skills

PythonShellAnsibleJenkinsAWSAzureGCPinfrastructure as codecontainerization technologiesincident commandincident managementIT OperationsITSM suitesServiceNowITIL Foundations v3+

Hard Skills

PythonShellAnsibleJenkinsAWSAmazon Web ServicesAzureGCPGoogle Cloud Platforminfrastructure as codecontainerization technologiesincident commandincident managementIT Operationstroubleshooting complex networktroubleshooting complex servertroubleshooting complex storagetroubleshooting complex application issuesincidentproblemchangerelease managementautomation toolsmonitoringperformancecapacity toolscontinuous integration/continuous deployment (CI/CD) pipelinesDevOps practicesSite Reliability Engineering principlesServerStorageNetworkMiddlewareApplicationCloud technologies

Soft Skills

leadership skillsproactive problem-solverstrong problem-solvingorganizational skillsanalytical skillsdecisivenessability to motivate teamsability to manage stressful situations calmly and effectivelyability to create constructive relationshipsability to influenceability to communicate with varying levels of associates and managementability to solve complexcross-functional issueshigh degree of curiosityability to seek more efficient ways of delivering service

Certifications

Preferred

ITIL Foundations v3+

Industry & Role

Industry Government/Public Sector

Job Function Ensure enterprise IT service reliability and availability through SRE practices and incident command leadership

Role Subtype Site Reliability Engineer

Tech Domains Python, DevOps & SRE, Amazon Web Services, Azure, Google Cloud Platform, Kubernetes, Docker, ITSM / ServiceNow, Networking / TCP-IP, Linux

Keywords for Your Resume

FLEX Service Availability AnalystService Availability ManagerSRE Service Availability ManagerSite Reliability EngineeringSite Reliability Engineering principlesDevOpsDevOps practicesincident commandincident managementincidentproblemchangerelease management24x7x365on-callPythonShellAnsibleJenkinsAWSAmazon Web ServicesAzureGCPGoogle Cloud Platforminfrastructure as codecontainerization technologiesITIL Foundations v3+ServiceNowITSM suites

Deal Breakers

Must have 3 years of IT Operations troubleshooting complex network, server, storage, and/or application issues, Must have 2 years minimum operations experience with incident, problem, change, and release management including leading calls and documenting outcomes, Must be able to cover shifts and on-call responsibilities in a 24x7x365 environment, Must have proficiency in scripting languages (Python, Shell)

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile