About this role
This temporary FLEX role focuses on ensuring enterprise IT service availability and peak performance through proactive Site Reliability Engineering and incident command leadership. The position emphasizes automation, cloud technologies, and continuous process improvement to minimize disruptions and strengthen the technology landscape.
Key Responsibilities
- Serve as Incident Commander during major incidents, leading response efforts to restore services and minimize impact on business and consumer operations
- Design and implement automation tools to reduce manual intervention and improve system performance
- Perform proactive service reliability engineering and continuous process improvement
- Manage and document incident, problem, change, and release management activities
- Use cloud, infrastructure as code, and containerization technologies to enhance availability and reliability
Technical Overview
You will lead major incident response as Incident Commander, improve reliability with monitoring/performance/capacity tooling, and build automation using Python/Shell plus Ansible and Jenkins. The role works across cloud platforms (AWS, Azure, GCP) using infrastructure as code and containerization technologies, with strong IT Operations and incident/problem/change/release management practices.
Ideal Candidate
The ideal candidate is a 5+ year IT operations professional with 2+ years of incident, problem, change, and release management experience, including leading calls and documenting outcomes. They are hands-on with Python and Shell scripting, automation using Ansible and Jenkins, and reliability work across cloud platforms (AWS, Azure, GCP) with infrastructure as code and containerization. They can serve as Incident Commander in a 24x7x365 environment and bring calm, decisive leadership during major incidents.
Must-Have Skills
5+ years of experience in an information technology environment.3 years of experience in information technology focused on IT Operations that include troubleshooting complex networkserverstorageand/or application issues.2 years minimum operations experience involving incidentproblemchangeand release management that included leading calls and documenting outcomes.Ability to cover shifts in a 24x7x365 environment and on-call responsibilities.Proficiency in scripting languages (PythonShell) and familiarity with automation tools (such as AnsibleJenkins).Experience with cloud platforms (AWSAzureGCP)infrastructure as codeand containerization technologies.Experience in incident command or incident management in a technology environment.Undergraduate degree or or equivalent experience/certification.
Nice-to-Have Skills
ITIL Foundations v3+ CertificationDemonstrated experience with ITSM suitese.g.ServiceNow.Demonstrated experience with various monitoringperformanceor capacity tools.Experience with continuous integration/continuous deployment (CI/CD) pipelines and DevOps practices.Familiarity with Site Reliability Engineering principles and concepts.Strong leadership qualitiesincluding decisivenessand the ability to motivate teamsalong with the ability to manage stressful situations calmly and effectively.Ability to create constructive relationshipsinfluenceand communicate with varying levels of associates and management.Ability to solve complexcross-functional issues.Strong knowledge of ServerStorageNetworkMiddlewareApplication and Cloud technologies.A high degree of curiosity and a drive to seek more efficient ways of delivering service.
Tools & Platforms
AnsibleJenkinsServiceNowAWSAmazon Web ServicesAzureGCPGoogle Cloud Platform
Required Skills
PythonShellAnsibleJenkinsAWSAzureGCPinfrastructure as codecontainerization technologiesincident commandincident managementIT OperationsITSM suitesServiceNowITIL Foundations v3+
Hard Skills
PythonShellAnsibleJenkinsAWSAmazon Web ServicesAzureGCPGoogle Cloud Platforminfrastructure as codecontainerization technologiesincident commandincident managementIT Operationstroubleshooting complex networktroubleshooting complex servertroubleshooting complex storagetroubleshooting complex application issuesincidentproblemchangerelease managementautomation toolsmonitoringperformancecapacity toolscontinuous integration/continuous deployment (CI/CD) pipelinesDevOps practicesSite Reliability Engineering principlesServerStorageNetworkMiddlewareApplicationCloud technologies
Soft Skills
leadership skillsproactive problem-solverstrong problem-solvingorganizational skillsanalytical skillsdecisivenessability to motivate teamsability to manage stressful situations calmly and effectivelyability to create constructive relationshipsability to influenceability to communicate with varying levels of associates and managementability to solve complexcross-functional issueshigh degree of curiosityability to seek more efficient ways of delivering service
Certifications
Preferred
ITIL Foundations v3+
Keywords for Your Resume
FLEX Service Availability AnalystService Availability ManagerSRE Service Availability ManagerSite Reliability EngineeringSite Reliability Engineering principlesDevOpsDevOps practicesincident commandincident managementincidentproblemchangerelease management24x7x365on-callPythonShellAnsibleJenkinsAWSAmazon Web ServicesAzureGCPGoogle Cloud Platforminfrastructure as codecontainerization technologiesITIL Foundations v3+ServiceNowITSM suites
Deal Breakers
Must have 3 years of IT Operations troubleshooting complex network, server, storage, and/or application issues, Must have 2 years minimum operations experience with incident, problem, change, and release management including leading calls and documenting outcomes, Must be able to cover shifts and on-call responsibilities in a 24x7x365 environment, Must have proficiency in scripting languages (Python, Shell)
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile