Responsibilities
Evaluate and ensure availability of components within their teams and identify how to bring all services within SLO (99.XX)
• Monitor systems for implemented automation and set SLI/SLOs along with respective stakeholders.
• Implementation of observability platform
• Review all ownership data and ensure it is current and complete.
• Review volume and accuracy of bugs assigned to the team and identify opportunities to improve automated triage.
• Identify CFBT (Customer Flow Based Testing) eligible flows, develop CFBT tests and train the team on how to write and maintain them.
• Lead post postmortems for any P1 or greater incidents during the rotation. Train the team on distributed problem management process.
• Operations and Design Consultation for driving high reliability.
• Emergency Incident Response with action-oriented postmortem/RCA/Incident debriefs.
• Driving continuous improvement through toil reduction and automation.
• Application Performance and availability analysis