About this role
System Development Manager for AWS Resilience leads incident response engineering and resilience tooling to ensure reliable AWS health across services. You’ll drive observability improvements, automation, and cross-team collaboration.
Key Responsibilities
- Incident Response Leadership
- Detection & Observability Improvement
- Cross-Site, Cross-Team Coordination
- Post-Incident Analysis
- Performance Management & Team Development
Technical Overview
Technical scope includes operational excellence, tooling for incident response, detection and observability improvements, and cross-team coordination across global teams. Emphasis on distributed systems, networking fundamentals, and post-incident analysis.
Ideal Candidate
The ideal candidate is a senior systems engineer with 5+ years in systems development and infrastructure operations, strong incident response leadership, and deep understanding of distributed systems and networking.
Must-Have Skills
1+ years of engineering team management experience5+ years of experience in systems engineeringsystems developmentor infrastructure operationsStrong understanding of distributed systemsnetworking fundamentalsand infrastructure failure modesExcellent communication skills
Nice-to-Have Skills
Experience hiringdeveloping and promoting engineering talentExperience using data to drive root cause elimination and process improvementExperience managing communication with geographically distributed teamsExperience with operational best practices: monitoringalertingand post-incident analysis
Required Skills
Systems engineeringsystems developmentinfrastructure operationsdistributed systemsnetworking fundamentalsincident responseobservabilitymonitoringpost-incident analysisroot-cause analysisoperational excellenceteam developmentleadership
Hard Skills
Systems engineeringSystems developmentInfrastructure operationsDistributed systemsNetworking fundamentalsIncident responseObservabilityMonitoringPost-incident analysisRoot-cause analysisOperational excellenceTeam developmentLeadership
Soft Skills
CommunicationLeadershipProblem-solvingMentoringCollaboration
Keywords for Your Resume
System Development ManagerAWS ResilienceAWS Incident ResponseAIRSeattlehybridincident response leadershipdetection & observability improvementcross-site coordinationpost-incident analysisperformance managementteam developmentdistributed systemsnetworking fundamentalsmonitoringalertingroot-cause analysissystems engineeringsystems developmentinfrastructure operationsincident responseobservabilityleadership
Deal Breakers
Less than 5 years of relevant experience, No incident response or distributed systems experience, Lack of leadership or team development experience
Get matched to jobs like this
Luna finds roles that fit your skills and career goals — no endless scrolling required.
Create a Free Profile