How do you implement on-call rotation and incident response in an SRE team?
A mature on-call process has these elements:
- Schedules: PagerDuty or OpsGenie for rotating on-call assignments with escalations.
- Runbooks: Every alert links to a runbook with investigation steps and common resolutions.
- Severity levels: P1 (major outage, wake anyone up) → P4 (low impact, business hours only).
- Incident channels: Dedicated Slack channel per incident. Assign Incident Commander, Communications Lead roles.
- Postmortems: Blameless postmortem for every P1/P2. Focus on system improvements, not blaming individuals.
- On-call health: Track toil. If engineers are getting paged more than 2-3 times per shift, the alert quality needs improvement.