How do you avoid alert fatigue in a large-scale microservices environment?
Alert fatigue happens when teams receive too many alerts, many of which are noise. Engineers start ignoring them — including real critical ones.
Strategies to combat it:
- Symptom-based alerting: Alert on user-facing symptoms (error rate, latency) not causes (CPU high). CPU high does not always mean users are impacted.
- Actionable alerts only: Every alert must have a clear runbook. If there’s no action to take, it shouldn’t be an alert.
- SLA-based alerting: Alert when you’re burning through your error budget too fast.
- Regular alert audits: Review and delete alerts that consistently fire without requiring action.
- Severity tiers: P1 wakes someone up. P3 creates a ticket. Many alerts should be P3.