What is the difference between metrics, logs, and traces in observability?
Observability is the ability to understand the internal state of a system by examining its external outputs. The three pillars of observability are metrics, logs, and traces.
Metrics
Metrics are numerical measurements collected over time. They represent the current state or behavior of a system in an aggregated form. Examples: CPU usage percentage, request count per second, error rate, memory usage, p99 latency.
Metrics are best for: Dashboards and alerting on system health. Detecting anomalies and trends over time. Capacity planning. Tools: Prometheus, Datadog, CloudWatch, Grafana (visualization).
Logs
Logs are timestamped records of discrete events that occurred in a system. They provide detailed context about what happened and when. Examples: Application error messages, HTTP access logs, audit trails, debug output.
Logs are best for: Debugging specific errors or incidents. Audit trails for compliance. Understanding event sequences. Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Loki, Splunk, CloudWatch Logs.
Traces
Traces follow a single request as it flows through distributed services, capturing the path and timing of each operation. A trace consists of spans – individual units of work with start time and duration. Trace IDs link all spans of a single request across services.
Traces are best for: Identifying bottlenecks in distributed systems. Understanding service dependencies. Debugging latency issues across microservices. Tools: Jaeger, Zipkin, AWS X-Ray, OpenTelemetry.
Using All Three Together
When an alert fires on a metric (e.g., high error rate), you look at logs to find the specific error messages, then use traces to see which service call failed and where the latency spike originated. OpenTelemetry is the open standard for collecting all three signal types across different languages and platforms.