How do you troubleshoot high memory usage causing OOMKilled events in production?
When a container exceeds its memory limit, the kernel OOM killer terminates it and Kubernetes logs OOMKilled. Steps to resolve:
- Identify:
kubectl describe pod <pod>— look forReason: OOMKilledin Last State. - Profile: Use
kubectl top podor Prometheus/Grafana to understand actual memory usage patterns. - Fix: Either increase limits if the app genuinely needs more memory, or find and fix the memory leak in the application code.
- Prevent: Set up PrometheusRule or Datadog alerts to notify before a pod hits its limit.