Intermediate Questions
Infrastructure management, deployment strategies, and delivery flows.
What is a WAF and when should you use AWS WAF vs Cloudflare?
A Web Application Firewall (WAF) filters and monitors HTTP traffic to protect against common attacks: SQL injection, XSS, DDoS, bad bots.
AWS WAF: Tight integration with CloudFront, ALB, API Gateway. Managed rule groups for OWASP, AWS managed rules. Good if you’re AWS-native. Can use IP reputation lists and rate-limiting rules.
Cloudflare: Operates at the DNS/edge level before traffic reaches AWS. Better DDoS mitigation due to Cloudflare’s massive global network. Simpler setup. Bot management is more mature.
In practice: Use Cloudflare as the outer layer for DDoS and global edge, then AWS WAF at the ALB for application-layer filtering. Defense in depth.
What is a CVE, and how do you track and remediate vulnerabilities in your infrastructure?
A CVE (Common Vulnerabilities and Exposures) is a public identifier for a known security vulnerability. Each CVE has a severity score (CVSS 0-10).
Tracking and remediation workflow:
- Discovery: Continuous scanning — Trivy/Snyk in CI for container images, Dependabot for code dependencies, AWS Inspector for EC2.
- Triage: Not all CVEs require immediate action. Prioritize by CVSS score, exploitability, and whether the vulnerable code path is actually used.
- Remediation: Update base image, update dependency, or apply vendor patch.
- Tracking: Log CVEs in your ticketing system with SLA (e.g., Critical = 24h, High = 7 days).
Explain file permissions in Linux (rwx, octal notation) and when to use sticky bit/setuid.
Linux file permissions have three sets: owner, group, others. Each can have: read (4), write (2), execute (1).
-rwxr-xr-- = 754
# Owner: rwx (7), Group: r-x (5), Others: r-- (4)
chmod 755 script.sh # Standard executable
chmod 644 config.yml # Standard config file
Special bits:
- Sticky bit (1xxx): On directories (e.g.,
/tmp), only the file owner can delete their own files:chmod +t /shared - Setuid (4xxx): File executes with the owner’s permissions (used by
/usr/bin/passwdto write/etc/shadowas root). Use with extreme caution.
How do you write effective Prometheus alerting rules?
Effective Prometheus alerts follow these principles:
groups:
- name: api-alerts
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.05
for: 5m # Must be true for 5 minutes before firing
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }}"
runbook: "https://wiki.internal/runbooks/high-error-rate"
Key practices: Use for to avoid alerting on momentary spikes. Always include a runbook link. Use human-readable messages with $labels and $value.
What is Prometheus and how does its pull-based model differ from push-based monitoring?
Prometheus is an open-source metrics monitoring system with a time-series database.
Pull-based (Prometheus): Prometheus actively scrapes metrics from targets at regular intervals. Targets expose a /metrics HTTP endpoint. Benefits: Prometheus controls the scraping schedule, easy to detect if a target is down, no credentials needed on the target side.
Push-based (StatsD, Graphite): Applications push metrics to a central collector. Better for short-lived jobs (like batch scripts) that may end before Prometheus scrapes them. Use Prometheus Pushgateway for these use cases.
What is an SLO, SLA, and SLI, and how do they relate to each other?
SLI (Service Level Indicator): An actual measurement of service behavior. Example: the percentage of successful HTTP requests.
SLO (Service Level Objective): The target for your SLI. Example: 99.9% of requests should succeed in the last 30 days.
SLA (Service Level Agreement): A contractual commitment to the SLO with defined consequences for missing it. Example: If availability drops below 99.9%, AWS credits customers.
In practice: define SLIs → set SLO targets → the SLA is what you promise externally. Your internal error budget is 100% - SLO.
What is the difference between a Prometheus Gauge, Counter, and Histogram metric type?
Counter: A cumulative value that only increases (or resets to zero on restart). Use for: total requests, total errors, bytes sent. Never use for values that can go down.
Gauge: A value that can go up or down. Use for: current memory usage, active connections, queue depth, temperature.
Histogram: Samples observations and counts them in configurable buckets. Use for: request latency, response sizes. Allows you to calculate percentiles (p50, p95, p99) — critical for SLOs.
What is AWS ECS and when would you choose it over EKS?
ECS (Elastic Container Service) is AWS’s native container orchestrator. EKS (Elastic Kubernetes Service) is managed Kubernetes.
Choose ECS when:
- Your team is AWS-native and doesn’t have Kubernetes expertise
- You want lower operational overhead (no Kubernetes control plane concepts to manage)
- Tight AWS service integration is a priority (IAM roles per task, ALB integration is simpler)
Choose EKS when:
- You need Kubernetes-native features (CRDs, Operators, Helm ecosystem)
- You have multi-cloud or hybrid requirements
- Your team already has Kubernetes expertise
Explain AWS VPC and its core components (subnets, route tables, IGW, NAT).
A VPC (Virtual Private Cloud) is your isolated network within AWS.
- Subnets: Subdivisions of your VPC in a specific AZ. Public subnets have a route to the IGW; private subnets do not.
- Route Tables: Rules defining where traffic is directed. A public subnet’s route table has
0.0.0.0/0 → IGW. - Internet Gateway (IGW): Allows public subnets to communicate with the internet.
- NAT Gateway: Allows private subnets to make outbound internet requests (e.g., pulling packages) without exposing them to inbound internet traffic.
What is the difference between an AWS Security Group and a Network ACL?
Security Groups (SGs): Stateful firewalls at the instance level. If you allow inbound traffic, the corresponding outbound response is automatically allowed. Rules are allow-only (no deny rules).
Network ACLs (NACLs): Stateless firewalls at the subnet level. You must explicitly allow both inbound and outbound traffic. Rules are evaluated in order (by rule number) and support both allow and deny.
In practice: Use Security Groups for most use cases. Use NACLs as an additional layer for blocking specific IP ranges (e.g., blocking a bad actor’s IP at the subnet boundary).
How do you handle sensitive values like passwords in Terraform without exposing them in state?
Terraform state files contain sensitive values in plaintext — this is a known limitation. Mitigations:
- Mark as sensitive:
sensitive = trueon variables and outputs prevents them from appearing in CLI output. - Avoid storing in state: Use AWS Secrets Manager or Vault to generate and store secrets externally. Reference via data source or environment variable.
- Encrypt state: S3 backend with server-side encryption (SSE-KMS).
- Restrict access: The S3 bucket containing state should have strict IAM policies — only CI/CD roles should have access.
How do Terraform modules work and what makes a good module?
A Terraform module is a reusable group of resource configurations. Every directory with .tf files is a module. You call modules from a root module to avoid repeating code.
What makes a good module:
- Single responsibility: One module for VPC, another for EKS, another for RDS.
- Parameterized: Accept variables to customize behavior per environment.
- Versioned: Pin module versions in the
sourceattribute. - Outputs: Expose useful outputs (VPC ID, subnet IDs) for other modules to consume.
What is Terraform state and why must it be stored remotely in a team environment?
Terraform state is a JSON file (terraform.tfstate) that maps your configuration to real-world resources. Terraform uses it to know what already exists before planning changes.
Storing it locally breaks team collaboration:
- Team members would each have different state files causing conflicts
- State file gets lost if the local machine breaks
- No locking mechanism — two engineers could run
applysimultaneously and corrupt state
Remote backends (S3 + DynamoDB for locking, GCS, Terraform Cloud) solve all three problems.
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-lock"
}
}
Explain the concept of a distroless image and its security benefits.
A distroless image contains only your application and its runtime dependencies — no shell, no package manager, no OS utilities. This comes from Google’s distroless project.
Security benefits: You cannot exec into a distroless container and run arbitrary commands. The attack surface is dramatically reduced because there are no standard Unix tools an attacker could use to move laterally.
# Distroless multi-stage example
FROM golang:1.22 AS builder
WORKDIR /app
COPY . .
RUN CGO_ENABLED=0 go build -o server .
FROM gcr.io/distroless/static-debian12
COPY --from=builder /app/server /server
CMD ["/server"]
How do you reduce Docker image size? Walk through your optimization strategy.
Image size directly affects pull times and attack surface. Key strategies:
- Use minimal base images:
alpineordistrolessinstead ofubuntu. - Multi-stage builds: Build in a full image, copy only the binary/artifact to a slim final image.
- Combine RUN commands: Each RUN creates a layer. Chain commands with
&&and clean up in the same layer. - Use .dockerignore: Exclude
node_modules,.git, test files from the build context.
# Multi-stage example
FROM node:20 AS builder
WORKDIR /app
COPY . .
RUN npm ci && npm run build
FROM node:20-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
CMD ["node", "dist/index.js"]
Explain the difference between a Liveness probe, Readiness probe, and Startup probe.
Liveness Probe: Checks if the container is alive. If it fails, Kubernetes restarts the container. Use this to recover from deadlocks.
Readiness Probe: Checks if the container is ready to serve traffic. If it fails, the pod is removed from Service endpoints (no traffic sent). Use this during slow startup or when temporarily overloaded.
Startup Probe: Only runs at startup. Allows slow-starting containers enough time to initialize before liveness checks begin. Prevents liveness probes from killing a pod that is simply starting up slowly.
What is a Kubernetes Ingress and how does it differ from a Service?
A Service exposes a set of pods internally or as a simple LoadBalancer. An Ingress is a Layer-7 (HTTP/HTTPS) routing rule that sits in front of multiple services and routes traffic based on hostname or path.
Example: Route api.example.com to the api-service and example.com to the frontend-service using a single load balancer IP. This is far more cost-effective than having a separate LoadBalancer service for each microservice.
Real Production Scenarios
Real-world architecture, system migration, and design challenges.
What is Zero Trust Architecture and how does it apply to DevOps?
Zero Trust is a security model based on “never trust, always verify.” Traditional networks trusted everything inside the perimeter. Zero trust assumes the network is already compromised.
Zero Trust principles in DevOps:
- Identity-based access: Every service authenticates. No implicit trust based on network location.
- Least privilege: Minimal permissions for every identity, re-evaluated regularly.
- Micro-segmentation: Kubernetes NetworkPolicies and service meshes with mTLS between every service.
- Device trust: Verify developer machines with fleet management (Jamf, Intune) before allowing access to internal systems.
- Continuous verification: Short-lived credentials. Re-authenticate frequently.
What is a bastion host (jump server) and what are the modern alternatives?
A bastion host is a dedicated, hardened server in a public subnet used as the only entry point for SSH/RDP into private subnet resources. All access is logged and audited.
Modern, better alternatives:
- AWS Systems Manager Session Manager: SSH into EC2 over HTTPS through the AWS API. No open port 22 required. All sessions logged to CloudWatch/S3. IAM-controlled access.
- Teleport: Open-source access platform with MFA, session recording, and role-based access for SSH, Kubernetes, databases, and web applications.
- Tailscale / WireGuard: Zero-config VPN mesh that avoids exposing any servers publicly.
What is SAST vs DAST and where do they fit in a DevSecOps pipeline?
SAST (Static Application Security Testing): Analyzes source code without executing it. Runs early in CI (on every commit/PR). Tools: Semgrep, SonarQube, Bandit (Python), gosec (Go). Fast, no running application needed.
DAST (Dynamic Application Security Testing): Tests the running application by sending malicious inputs and analyzing responses. Runs against a deployed staging environment. Tools: OWASP ZAP, Burp Suite. Finds runtime vulnerabilities that SAST misses (SQL injection, auth bypass).
DevSecOps pipeline: SAST on PR → build image → Trivy scan → deploy to staging → DAST → promote to prod.
How do you use awk, sed, and grep together to parse log files?
These three tools form the backbone of Linux log analysis:
# grep: Filter lines containing "ERROR"
grep "ERROR" /var/log/app.log
# awk: Extract specific fields (e.g., column 3 of an NGINX access log)
awk '{print $3}' /var/log/nginx/access.log
# sed: Replace or transform text
sed 's/ERROR/CRITICAL/g' app.log
# Combined pipeline: Find ERROR lines, extract IP (field 1), count by IP
grep "ERROR" /var/log/nginx/access.log \
| awk '{print $1}' \
| sort \
| uniq -c \
| sort -rn \
| head -10
Write a Bash script to find and delete log files older than 30 days.
#!/bin/bash
# Delete log files older than 30 days in /var/log/myapp
LOG_DIR="/var/log/myapp"
DAYS=30
DRY_RUN=false # Set to false to actually delete
if [ ! -d "$LOG_DIR" ]; then
echo "Directory $LOG_DIR does not exist"
exit 1
fi
if [ "$DRY_RUN" = true ]; then
echo "Dry run — files that would be deleted:"
find "$LOG_DIR" -name "*.log" -mtime +$DAYS -print
else
echo "Deleting log files older than $DAYS days..."
find "$LOG_DIR" -name "*.log" -mtime +$DAYS -delete
echo "Done. Freed up space:"
df -h "$LOG_DIR"
fi
Always implement a dry run mode. Schedule this with cron or use logrotate for production systems.
What is log aggregation and how do you implement it with the ELK stack?
Log aggregation centralizes logs from all services into one searchable system. The ELK Stack:
- Elasticsearch: Distributed search and analytics engine that indexes and stores logs.
- Logstash: Data processing pipeline that ingests, transforms, and forwards logs.
- Kibana: Web UI for searching, visualizing, and creating dashboards from Elasticsearch data.
Modern replacement: The EFK Stack uses Fluent Bit (lightweight, lower memory than Logstash) as a DaemonSet in Kubernetes to collect container logs and forward to Elasticsearch. Or use Loki (from Grafana Labs) for a simpler, cost-effective log aggregation layer.
How do you structure a Grafana dashboard for a production service?
A well-structured production dashboard follows the USE or RED methodology:
RED (for services):
- Rate: Requests per second
- Errors: Error rate (%)
- Duration: Latency (p50, p90, p99)
Top-level layout: Start with an SLO summary panel so on-call knows immediately if SLO is being violated. Then drill-down panels: per-endpoint breakdown, error log links, infrastructure metrics (CPU, memory). Use variables for environment and service selection.
What is AWS CloudWatch and what are its main components?
CloudWatch is AWS’s native observability service with four main areas:
- Metrics: Time-series data from AWS services (CPU, NetworkIn, etc.) and custom metrics you publish.
- Logs: CloudWatch Logs for storing, searching, and analyzing log data from EC2, Lambda, ECS, etc.
- Alarms: Alerts triggered when metrics exceed thresholds. Can trigger SNS, Auto Scaling, Lambda.
- Dashboards: Visual widgets to display metrics across services in real-time.
For advanced analytics, ship logs to OpenSearch (ELK) or use CloudWatch Logs Insights for SQL-like queries.
How do you reduce AWS costs in a cloud environment? What are your go-to strategies?
Cloud cost optimization is an ongoing practice. High-impact strategies:
- Right-sizing: Use AWS Cost Explorer and Compute Optimizer to identify oversized EC2 instances.
- Reserved Instances/Savings Plans: Commit to 1-3 years for stable workloads — saves up to 72%.
- Spot Instances: Use for stateless, fault-tolerant, or batch workloads. Up to 90% savings.
- S3 Lifecycle policies: Auto-transition to cheaper storage tiers.
- Delete idle resources: Audit unused EIPs, old snapshots, unattached EBS volumes.
- Auto Scaling: Scale down to zero or minimum outside business hours.
Explain the Terraform resource lifecycle and meta-arguments like create_before_destroy.
The lifecycle block gives you fine-grained control over how Terraform manages resource replacement:
resource "aws_instance" "web" {
ami = "ami-12345"
instance_type = "t3.medium"
lifecycle {
create_before_destroy = true # New instance created before old one is destroyed
ignore_changes = [ami] # Ignore external AMI changes
prevent_destroy = true # Block accidental deletion
}
}
create_before_destroy is critical for zero-downtime replacements. Without it, Terraform destroys the old resource first, creating a gap in availability.
What are Terraform data sources and how do they differ from resources?
A resource creates, updates, or destroys infrastructure. A data source reads existing infrastructure that is managed outside of your current Terraform code — it is read-only.
# Data source — reads an existing VPC by tag, does not create it
data "aws_vpc" "main" {
tags = {
Environment = "production"
}
}
# Use the data source output
resource "aws_subnet" "app" {
vpc_id = data.aws_vpc.main.id
...
}
Data sources are essential for referencing shared infrastructure managed by a different team or Terraform root module.
What is the purpose of a staging environment and what tests should run there?
Staging is a production-mirror environment used to catch bugs that only appear with real data, full infrastructure, and realistic load — things unit tests can’t surface. Tests to run in staging:
- Integration tests: Real database connections, real API calls to third parties.
- E2E tests: Cypress, Playwright, or Selenium to simulate real user journeys.
- Smoke tests: Quick sanity checks that critical paths work after deployment.
- Performance tests: Load tests with k6 or Locust to catch regressions.
How do you handle database migrations in a CI/CD pipeline without downtime?
Database migrations are one of the riskiest parts of deployment. The golden rule: migrations must be backward-compatible because during a rolling deploy, old code and new code run simultaneously.
Safe migration checklist:
- Never: Rename or drop a column in the same deploy that uses the new name.
- Step 1: Add new column (nullable, backward-compatible).
- Step 2: Deploy code that writes to both old and new columns.
- Step 3: Migrate existing data.
- Step 4: Deploy code using only the new column.
- Step 5: Drop the old column.
How do you speed up slow CI pipelines?
Slow pipelines kill developer productivity. Key optimizations:
- Caching: Cache dependencies (node_modules, pip packages, Go modules) between runs.
- Parallelism: Split test suites and run jobs in parallel.
- Test selection: Only run tests affected by the changed code.
- Optimized Docker builds: Use layer caching and BuildKit.
- Self-hosted runners: Eliminate queue time and use faster hardware.
- Fail fast: Run linting and unit tests first; integration tests only if those pass.
What is GitOps and how does it differ from traditional CI/CD?
Traditional CI/CD: The pipeline has credentials and directly pushes deployments to environments (push-based).
GitOps: Git is the single source of truth for the desired state of your infrastructure and applications. An agent running in the cluster (like ArgoCD or Flux) continuously reconciles the actual state with the desired state in Git (pull-based).
Benefits of GitOps: Drift detection, audit trail in Git history, easy rollback (git revert), no outbound credentials needed in CI.
How do you implement automated rollback in a deployment pipeline?
Automated rollback is triggered when post-deployment health checks fail. A robust implementation:
- Health check gate: After deployment, poll the health endpoint for 2-3 minutes.
- Metric thresholds: Monitor error rate and p99 latency for 5 minutes post-deploy.
- Rollback trigger: If error rate exceeds a threshold, automatically re-deploy the previous image tag.
# Generic shell rollback logic
NEW_VERSION="v2.0"
PREV_VERSION="v1.9"
deploy $NEW_VERSION
if ! health_check_passes; then
echo "Rollback triggered"
deploy $PREV_VERSION
alert_pagerduty "Automatic rollback executed"
fi
What is the difference between a Blue/Green deployment and a Canary deployment?
Blue/Green: You maintain two identical environments. “Blue” is live, “Green” has the new version. You switch all traffic from Blue to Green at once. Rollback is instant — just switch back. Downside: doubles infrastructure cost.
Canary: You gradually shift traffic from the old version to the new one — e.g., 5% → 25% → 50% → 100%. You analyze metrics and errors at each stage. Slower but safer for catching issues that only appear under real production load.
How do you implement secret management in a GitHub Actions pipeline?
Never hardcode secrets in your pipeline files. GitHub Actions provides an encrypted Secrets store:
- Go to Repository Settings → Secrets and Variables → Actions → New Repository Secret.
- Reference in your workflow:
${{ secrets.MY_SECRET }}
- name: Deploy to AWS
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: aws s3 sync ./dist s3://my-bucket
For more advanced use cases, use OIDC to get short-lived tokens from AWS/GCP instead of storing static credentials.
How do Docker volumes differ from bind mounts?
Docker Volumes are managed by Docker, stored in /var/lib/docker/volumes/, and are the recommended way to persist data. They are portable, easy to back up, and work well with Docker Compose.
Bind Mounts map a specific host path directly into the container. They are useful in development to sync source code in real-time but are host-dependent and harder to manage in production.
# Volume (recommended for production)
docker run -v mydata:/app/data myapp
# Bind mount (recommended for development)
docker run -v $(pwd)/src:/app/src myapp
Explain Docker layer caching and how it impacts build speed.
Docker builds images layer by layer. If a layer hasn’t changed since the last build, Docker reuses the cached version. The trick is layer ordering:
Bad: COPY all files first, then run npm install. Any code change invalidates the npm install cache.
Good: COPY package.json first, run npm install, then COPY the rest of the source. Dependency installation only re-runs when package.json changes.
# Optimized layer order
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build
What is the purpose of a PodDisruptionBudget (PDB) in Kubernetes?
A PodDisruptionBudget limits how many pods of a deployment can be unavailable simultaneously during voluntary disruptions like node drains, cluster upgrades, or scaling down.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: api
Without a PDB, a cluster upgrade could drain multiple nodes simultaneously and take down your entire service. With minAvailable: 2, Kubernetes ensures at least 2 pods are always running.
How does the Kubernetes Horizontal Pod Autoscaler (HPA) work?
HPA automatically scales the number of pod replicas based on observed metrics. The default metric is CPU utilization, but it also supports memory and custom metrics via the Metrics API.
kubectl autoscale deployment my-app --min=2 --max=10 --cpu-percent=60
The HPA controller checks metrics every 15 seconds (default) and adjusts replicas to maintain the target. For custom metrics, you can integrate tools like KEDA (Kubernetes Event-Driven Autoscaling) which can scale based on Kafka lag, SQS queue depth, and more.
How do services in different namespaces communicate in Kubernetes?
All services in a Kubernetes cluster are reachable via DNS using the Fully Qualified Domain Name (FQDN):
<service-name>.<namespace>.svc.cluster.local
For example, a service named postgres in the production namespace is reachable at postgres.production.svc.cluster.local from any pod in any namespace. If NetworkPolicies are in place, you must explicitly allow cross-namespace traffic.
What is the difference between a StatefulSet and a Deployment?
Use a Deployment for stateless workloads (web servers, APIs) where any Pod is interchangeable. Use a StatefulSet for stateful workloads like databases that need:
- Stable, predictable network identities (pod-0, pod-1, etc.)
- Ordered, graceful deployment and scaling
- Stable persistent storage linked to each pod individually
Common examples: Kafka, ZooKeeper, Cassandra, PostgreSQL replicas.
How do you perform a zero-downtime rolling update in Kubernetes?
Kubernetes Deployments support RollingUpdate strategy by default. The key is configuring maxSurge and maxUnavailable correctly alongside working readiness probes.
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
With maxUnavailable: 0, Kubernetes will never take down an old Pod until the new one is healthy (as determined by its readiness probe). This guarantees zero downtime.
What is a ‘StatefulSet’ and when should you use it over a ‘Deployment’ in Kubernetes?
A StatefulSet is used for stateful applications that require unique, persistent identities and stable network identifiers. Unlike Deployments, which are for stateless pods, StatefulSets manage pods that are not interchangeable and have sticky identities.
Troubleshooting Scenarios
Live system debugging, incident diagnostics, and latency resolution.
How do you troubleshoot disk space issues on a Linux server?
Systematic disk investigation:
# Step 1: Check overall disk usage
df -h
# Step 2: Find which directory is consuming space
du -sh /* 2>/dev/null | sort -rh | head -20
# Step 3: Drill down into the problem directory
du -sh /var/* | sort -rh | head -10
# Step 4: Find specific large files
find / -type f -size +500M 2>/dev/null
# Step 5: Check for deleted-but-open files still consuming inodes
lsof | grep deleted
Common causes: application logs not rotating, large core dumps, MySQL/Postgres WAL overflow, old Docker images/volumes.
How do you troubleshoot high CPU usage on a Linux server?
Systematic CPU investigation:
- top / htop: Identify the process consuming CPU. Note: is it user space or kernel (
%usvs%sy)? - ps aux –sort=-%cpu: Snapshot of top CPU consumers.
- perf top: See which kernel functions are hot.
- strace -p <PID>: Trace system calls to understand what a process is doing.
- vmstat 1: Observe context switches (
cs) and interrupts (in).
Common causes: runaway application bug, CPU-intensive query (full table scan), kernel work from high I/O (softirqs), insufficient CPU for the workload.
What are dangling Docker images and how do you clean them up?
Dangling images are layers that have no associated tag — they appear as <none>:<none> in docker images. They accumulate over time from rebuilds and waste disk space.
# List dangling images
docker images -f dangling=true
# Remove all dangling images
docker image prune
# Nuclear option — remove all unused images, containers, networks, volumes
docker system prune -a --volumes
In CI/CD pipelines, always run docker system prune -f as a post-step to keep agents clean.