How do you structure a Grafana dashboard for a production service?
A well-structured production dashboard follows the USE or RED methodology:
RED (for services):
- Rate: Requests per second
- Errors: Error rate (%)
- Duration: Latency (p50, p90, p99)
Top-level layout: Start with an SLO summary panel so on-call knows immediately if SLO is being violated. Then drill-down panels: per-endpoint breakdown, error log links, infrastructure metrics (CPU, memory). Use variables for environment and service selection.