Observability Interview Questions

Master Observability with these real-world interview questions and answers.

Switch Topic:

Beginner Questions

Core concepts, syntax, and foundational command-line knowledge.

Easy Associate Level Observability

What is the difference between monitoring and observability?

Monitoring is about tracking known failure modes. You define metrics and alerts for things you know can go wrong. It answers: “Is this thing I’m watching broken?”

Observability is about understanding system behavior from its outputs. It allows you to answer questions you didn’t think to ask beforehand — debugging novel failures you’ve never seen before.

Monitoring tells you something is wrong. Observability tells you why. You need both, but as systems grow more complex, observability becomes more critical for understanding emergent failures.

Easy Associate Level Observability

What are the three pillars of observability?

The three pillars of observability are:

Metrics: Numerical measurements aggregated over time (CPU usage, request rate, error rate). Good for dashboards and alerting on trends.
Logs: Timestamped records of discrete events. Good for debugging specific incidents and understanding what happened.
Traces: Records of a request’s journey through a distributed system. Essential for finding bottlenecks and understanding service dependencies in microservices.

Together they answer: Is something wrong? (metrics), What is wrong? (logs), Where and why is it wrong? (traces).

Intermediate Questions

Infrastructure management, deployment strategies, and delivery flows.

Medium Senior Level Observability System Design

What is the difference between metrics, logs, and traces in observability?

Observability is the ability to understand the internal state of a system by examining its external outputs. The three pillars of observability are metrics, logs, and traces.

Metrics

Metrics are numerical measurements collected over time. They represent the current state or behavior of a system in an aggregated form. Examples: CPU usage percentage, request count per second, error rate, memory usage, p99 latency.

Metrics are best for: Dashboards and alerting on system health. Detecting anomalies and trends over time. Capacity planning. Tools: Prometheus, Datadog, CloudWatch, Grafana (visualization).

Logs

Logs are timestamped records of discrete events that occurred in a system. They provide detailed context about what happened and when. Examples: Application error messages, HTTP access logs, audit trails, debug output.

Logs are best for: Debugging specific errors or incidents. Audit trails for compliance. Understanding event sequences. Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Loki, Splunk, CloudWatch Logs.

Traces

Traces follow a single request as it flows through distributed services, capturing the path and timing of each operation. A trace consists of spans – individual units of work with start time and duration. Trace IDs link all spans of a single request across services.

Traces are best for: Identifying bottlenecks in distributed systems. Understanding service dependencies. Debugging latency issues across microservices. Tools: Jaeger, Zipkin, AWS X-Ray, OpenTelemetry.

Using All Three Together

When an alert fires on a metric (e.g., high error rate), you look at logs to find the specific error messages, then use traces to see which service call failed and where the latency spike originated. OpenTelemetry is the open standard for collecting all three signal types across different languages and platforms.

Medium Senior Level Observability

How do you write effective Prometheus alerting rules?

Effective Prometheus alerts follow these principles:

groups:
- name: api-alerts
  rules:
  - alert: HighErrorRate
    expr: |
      rate(http_requests_total{status=~"5.."}[5m])
      / rate(http_requests_total[5m]) > 0.05
    for: 5m  # Must be true for 5 minutes before firing
    labels:
      severity: critical
    annotations:
      summary: "High error rate on {{ $labels.service }}"
      description: "Error rate is {{ $value | humanizePercentage }}"
      runbook: "https://wiki.internal/runbooks/high-error-rate"

Key practices: Use for to avoid alerting on momentary spikes. Always include a runbook link. Use human-readable messages with $labels and $value.

Medium Senior Level Observability

What is Prometheus and how does its pull-based model differ from push-based monitoring?

Prometheus is an open-source metrics monitoring system with a time-series database.

Pull-based (Prometheus): Prometheus actively scrapes metrics from targets at regular intervals. Targets expose a /metrics HTTP endpoint. Benefits: Prometheus controls the scraping schedule, easy to detect if a target is down, no credentials needed on the target side.

Push-based (StatsD, Graphite): Applications push metrics to a central collector. Better for short-lived jobs (like batch scripts) that may end before Prometheus scrapes them. Use Prometheus Pushgateway for these use cases.

Medium Senior Level Observability

What is an SLO, SLA, and SLI, and how do they relate to each other?

SLI (Service Level Indicator): An actual measurement of service behavior. Example: the percentage of successful HTTP requests.

SLO (Service Level Objective): The target for your SLI. Example: 99.9% of requests should succeed in the last 30 days.

SLA (Service Level Agreement): A contractual commitment to the SLO with defined consequences for missing it. Example: If availability drops below 99.9%, AWS credits customers.

In practice: define SLIs → set SLO targets → the SLA is what you promise externally. Your internal error budget is 100% - SLO.

Medium Senior Level Observability

What is the difference between a Prometheus Gauge, Counter, and Histogram metric type?

Counter: A cumulative value that only increases (or resets to zero on restart). Use for: total requests, total errors, bytes sent. Never use for values that can go down.

Gauge: A value that can go up or down. Use for: current memory usage, active connections, queue depth, temperature.

Histogram: Samples observations and counts them in configurable buckets. Use for: request latency, response sizes. Allows you to calculate percentiles (p50, p95, p99) — critical for SLOs.

Real Production Scenarios

Real-world architecture, system migration, and design challenges.

Medium Senior Level Observability

What is log aggregation and how do you implement it with the ELK stack?

Log aggregation centralizes logs from all services into one searchable system. The ELK Stack:

Elasticsearch: Distributed search and analytics engine that indexes and stores logs.
Logstash: Data processing pipeline that ingests, transforms, and forwards logs.
Kibana: Web UI for searching, visualizing, and creating dashboards from Elasticsearch data.

Modern replacement: The EFK Stack uses Fluent Bit (lightweight, lower memory than Logstash) as a DaemonSet in Kubernetes to collect container logs and forward to Elasticsearch. Or use Loki (from Grafana Labs) for a simpler, cost-effective log aggregation layer.

Hard Lead / Architect Level Observability

What is distributed tracing and how do you implement it with OpenTelemetry?

In a microservices architecture, a single user request touches dozens of services. Distributed tracing follows that request across all services, recording timing and metadata at each step.

OpenTelemetry (OTel) is the vendor-neutral standard for instrumentation. Implementation:

Add the OTel SDK to each service.
Services automatically propagate a traceparent header in HTTP calls, linking all spans.
A collector (OTel Collector) receives spans and routes them to your backend (Jaeger, Zipkin, Tempo, Datadog).
You can now visualize the full request path, identify slow spans, and pinpoint errors.

Medium Senior Level Observability

How do you structure a Grafana dashboard for a production service?

A well-structured production dashboard follows the USE or RED methodology:

RED (for services):

Rate: Requests per second
Errors: Error rate (%)
Duration: Latency (p50, p90, p99)

Top-level layout: Start with an SLO summary panel so on-call knows immediately if SLO is being violated. Then drill-down panels: per-endpoint breakdown, error log links, infrastructure metrics (CPU, memory). Use variables for environment and service selection.

Hard Lead / Architect Level Observability

How do you avoid alert fatigue in a large-scale microservices environment?

Alert fatigue happens when teams receive too many alerts, many of which are noise. Engineers start ignoring them — including real critical ones.

Strategies to combat it:

Symptom-based alerting: Alert on user-facing symptoms (error rate, latency) not causes (CPU high). CPU high does not always mean users are impacted.
Actionable alerts only: Every alert must have a clear runbook. If there’s no action to take, it shouldn’t be an alert.
SLA-based alerting: Alert when you’re burning through your error budget too fast.
Regular alert audits: Review and delete alerts that consistently fire without requiring action.
Severity tiers: P1 wakes someone up. P3 creates a ticket. Many alerts should be P3.

Troubleshooting Scenarios

Live system debugging, incident diagnostics, and latency resolution.

Hard Lead / Architect Level Observability

How do you implement on-call rotation and incident response in an SRE team?

A mature on-call process has these elements:

Schedules: PagerDuty or OpsGenie for rotating on-call assignments with escalations.
Runbooks: Every alert links to a runbook with investigation steps and common resolutions.
Severity levels: P1 (major outage, wake anyone up) → P4 (low impact, business hours only).
Incident channels: Dedicated Slack channel per incident. Assign Incident Commander, Communications Lead roles.
Postmortems: Blameless postmortem for every P1/P2. Focus on system improvements, not blaming individuals.
On-call health: Track toil. If engineers are getting paged more than 2-3 times per shift, the alert quality needs improvement.

Easy Associate Level Observability

What is an error budget and how do SRE teams use it?

An error budget is the allowable amount of unreliability in a service, derived from the SLO. If your SLO is 99.9% uptime, your error budget is 0.1% — about 43 minutes of downtime per month.

How teams use it:

When error budget is healthy → deploy freely, take risks, ship features.
When error budget is low → slow down deployments, prioritize reliability work.
When budget is exhausted → freeze all non-critical deployments until reliability improves.

Error budgets create a shared language between product (wants to ship) and SRE (wants reliability). It’s objective, not political.

My Practice Workspace

No saved questions yet. Click the Save button on any question to save it here.

No recently viewed questions.