Position Details

Type Full-Time

Experience senior

Exp. Years 5+ years

Education Bachelor's degree in computer science, computer engineering, or related field

Category DevOps & SRE

About this role

Observability DevOps Site Reliability Engineer (SRE) to develop and support observability capabilities across Cisco IT Datacenter and Cloud environments, leveraging AI/ML to improve reliability, and owning monitoring automation and toolchains.

Key Responsibilities

Own reliability and scalability of observability platforms
Implement AI/LLM-based monitoring use-cases
Lead SRE technologies and toolchain maintenance
Collaborate with distributed teams
Drive automation in monitoring and incident response

Technical Overview

DevOps/SRE with strong observability stack: Splunk, Prometheus/Thanos, Grafana; containerization with Docker/Kubernetes/OpenShift; cloud experience (AWS/GCP/Azure); code/scripting in Python/Go; CI/CD with GitHub/Jenkins; on-prem & cloud integration

Ideal Candidate

The ideal candidate is a senior devops/sre with 5+ years of experience, strong observability and AI/ML capabilities, and hands-on experience with containerization (Docker/Kubernetes/OpenShift). Comfortable with multi-cloud and on-prem monitoring tools (Splunk, Prometheus, Grafana, Elastic), and able to lead across geographically distributed teams.

Must-Have Skills

Bachelor's degree in computer science or related field5+ years of relevant experienceExperience with Docker and Linux-based infrastructures

Nice-to-Have Skills

Splunk Cloud / Splunk Observability CloudElastic / Prometheus / Thanos & GrafanaThousandEyes / Zabbix / AppDynamicsJavaScript (Node.js or React)AI/ML & LLM-based Observability

Tools & Platforms

GitHubJenkinsSplunk CloudSplunk Observability CloudElasticPrometheusThanosGrafanaThousandEyesZabbixAppDynamicsDockerKubernetesOpenShiftVMwareOpenStackAnsible

Required Skills

Bachelor's degree in CS or related field; 5+ years of relevant experience; Docker and Linux-based infra; GitHub; Jenkins; CI/CD; Kubernetes/OpenShift; Python/Go; Prometheus/Thanos; Grafana; Splunk/Elastic; JavaScript (Node.js/React); AI/ML observability

Hard Skills

Observability technologiesAI/MLGitHubJenkinsPythonShellGoDockerKubernetesOpenShiftVMwareOpenStackPrometheusThanosGrafanaSplunkElasticZabbixAppDynamicsThousandEyesNode.jsReactAI/LLM based Agentic Observability

Soft Skills

LeadershipCross-team collaborationSelf-motivatedRelationship buildingLearning agility

Certifications

Preferred

AWS Solutions Architect - AssociateAWS Solutions Architect - ProfessionalAWS Certified Security SpecialtyAWS Developer - AssociateAWS DevOps Engineer ProfessionalISC2 Certified Cloud Security Professional (CCSP)CRISCCISSPCISMCISA

Industry & Role

Industry Networking & Telecom

Job Function Develop and maintain observability capabilities and reliability solutions for workloads across Cisco IT Datacenter and Cloud environments.

Role Subtype Site Reliability Engineer

Tech Domains Linux, Docker, Kubernetes, OpenShift, Prometheus, Grafana, Splunk, Elastic, Python, Go

Keywords for Your Resume

observabilitysite reliability engineersredevopsaimlgenaikubernetesdockeropenshiftprometheusgrafanasplunkelasticzabbixappdynamicsgitHubjenkinsgitnode.jsreactcloudawsgcpazure

Deal Breakers

Lack of 5+ years of relevant experience, No Docker or Linux infrastructure experience, Lack of hands-on container and monitoring tool experience

Apply for this Position →

Get matched to jobs like this

Luna finds roles that fit your skills and career goals — no endless scrolling required.

Create a Free Profile