Position Details
About this role
Observability DevOps Site Reliability Engineer (SRE) to develop and support observability capabilities across Cisco IT Datacenter and Cloud environments, leveraging AI/ML to improve reliability, and owning monitoring automation and toolchains.
Key Responsibilities
- Own reliability and scalability of observability platforms
- Implement AI/LLM-based monitoring use-cases
- Lead SRE technologies and toolchain maintenance
- Collaborate with distributed teams
- Drive automation in monitoring and incident response
Technical Overview
DevOps/SRE with strong observability stack: Splunk, Prometheus/Thanos, Grafana; containerization with Docker/Kubernetes/OpenShift; cloud experience (AWS/GCP/Azure); code/scripting in Python/Go; CI/CD with GitHub/Jenkins; on-prem & cloud integration
Ideal Candidate
The ideal candidate is a senior devops/sre with 5+ years of experience, strong observability and AI/ML capabilities, and hands-on experience with containerization (Docker/Kubernetes/OpenShift). Comfortable with multi-cloud and on-prem monitoring tools (Splunk, Prometheus, Grafana, Elastic), and able to lead across geographically distributed teams.
Must-Have Skills
Nice-to-Have Skills
Tools & Platforms
Required Skills
Hard Skills
Soft Skills
Certifications
Preferred
Industry & Role
Keywords for Your Resume
Deal Breakers
Lack of 5+ years of relevant experience, No Docker or Linux infrastructure experience, Lack of hands-on container and monitoring tool experience
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile