All DevOps Interview Questions

Browse our comprehensive question bank. Updated regularly with real interview scenarios.

Switch Topic:

Advanced Questions

Enterprise orchestration, deep architectural concepts, and scaling issues.

Hard Lead / Architect Level System Design
Q:

Explain the OWASP Top 10 and which items are most relevant to DevOps engineers.

The OWASP Top 10 are the most critical web application security risks. Most relevant to DevOps:

  • A01: Broken Access Control — Enforce least privilege in IAM, K8s RBAC. Verify RBAC policies in code review.
  • A05: Security Misconfiguration — Public S3 buckets, default credentials, exposed management ports. Caught by infrastructure scanning tools like Checkov, tfsec.
  • A06: Vulnerable Components — Use Dependabot and Trivy to catch outdated dependencies with known CVEs.
  • A09: Security Logging Failures — Ensure CloudTrail, K8s audit logs, and application audit logs are enabled and shipped to a SIEM.
Hard Lead / Architect Level Linux
Q:

What is a Load Average in Linux and how do you interpret it?

Load average in top or uptime shows three numbers: 1-minute, 5-minute, and 15-minute averages of the number of processes in a runnable or uninterruptible state.

Interpretation depends on the number of CPU cores. On a 4-core server:

  • Load average of 4.0 = 100% utilization — every CPU busy but nothing waiting
  • Load average of 8.0 = 200% utilization — 4 CPUs busy, 4 processes waiting in queue
  • Load average of 0.5 = 12.5% utilization — plenty of headroom

Key insight: High load average is NOT always CPU. Uninterruptible sleep (disk I/O wait) also counts. Check iostat to distinguish CPU saturation from I/O saturation.

Hard Lead / Architect Level Linux
Q:

What are Linux namespaces and cgroups, and how do they enable container isolation?

Namespaces provide isolation for system resources so each container sees its own view of the system:

  • pid — isolated process tree (container sees its own PIDs starting at 1)
  • net — isolated network stack (own IP, routing table)
  • mnt — isolated filesystem mounts
  • uts — isolated hostname
  • user — isolated user/group IDs

cgroups (Control Groups) limit and account for resource usage (CPU, memory, I/O) per group of processes. This is how Docker enforces your CPU/memory limits.

Together: namespaces provide isolation (what can be seen), cgroups provide resource limits (how much can be used).

Hard Lead / Architect Level AWS
Q:

How do you implement least-privilege IAM policies and why is it critical?

Least-privilege means granting only the exact permissions needed to perform a task — no more. This limits blast radius if credentials are compromised.

Implementation steps:

  1. Start with deny-all, add allows: Begin with minimal permissions and add only what’s needed.
  2. IAM Access Analyzer: Use to identify unused permissions and generate least-privilege policies based on CloudTrail logs.
  3. Policy conditions: Add StringEquals conditions to restrict resources by tag, region, or account.
  4. Permission boundaries: Cap the maximum permissions a principal can have, even if attached policies are more permissive.
"Condition": {
  "StringEquals": {
    "aws:RequestedRegion": "us-east-1"
  }
}
Hard Lead / Architect Level Terraform
Q:

What are Terraform providers and how do you handle provider version pinning?

Providers are plugins that translate Terraform configuration into API calls to AWS, GCP, Azure, etc. Always pin provider versions to prevent unexpected changes from provider upgrades:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"  # Allows 5.x but not 6.x
    }
  }
  required_version = ">= 1.7.0"
}

provider "aws" {
  region = "us-east-1"
}

Run terraform providers lock to generate a .terraform.lock.hcl file that locks exact versions and checksums. Commit this file to Git.

Hard Lead / Architect Level Terraform
Q:

How do you manage multiple environments (dev/staging/prod) in Terraform? Workspaces vs. directory structure.

Two main approaches:

Terraform Workspaces: Use the same code but switch workspace to change state. Simple, but the same code runs for all environments — hard to have different variable values per environment. Suitable for simple differences.

Separate Directories (recommended): Each environment has its own directory with its own terraform.tfvars and remote state. This is explicit, auditable, and allows environments to diverge safely.

environments/
  dev/
    main.tf → calls shared module
    terraform.tfvars
  staging/
    main.tf
    terraform.tfvars
  prod/
    main.tf
    terraform.tfvars
modules/
  vpc/
  eks/
Hard Lead / Architect Level Docker
Q:

How do you implement health checks in Docker and why are they important for orchestration?

The HEALTHCHECK instruction tells Docker how to test if a container is working correctly. Without it, Docker considers a container healthy as soon as the process starts — even if the app inside has crashed.

HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD curl -f http://localhost:8080/health || exit 1

In Kubernetes, this is replaced by Liveness and Readiness probes. In Docker Compose or standalone Docker, HEALTHCHECK is critical for orchestration tools to know whether to send traffic to a container.

Hard Lead / Architect Level Docker
Q:

How would you run containers as a non-root user for security hardening?

Running containers as root is a significant security risk. If an attacker escapes the container, they have root on the host. Harden your images:

FROM node:20-alpine

# Create a non-root user
RUN addgroup -S appgroup && adduser -S appuser -G appgroup

# Set working directory and permissions
WORKDIR /app
COPY --chown=appuser:appgroup . .

# Switch to non-root user
USER appuser

CMD ["node", "index.js"]

Also enforce this at the Kubernetes level with a SecurityContext: runAsNonRoot: true.

Hard Lead / Architect Level Kubernetes
Q:

Explain Kubernetes RBAC and how you would give a service account read-only access to pods.

RBAC (Role-Based Access Control) is the authorization mechanism in Kubernetes. It uses three objects:

  • Role/ClusterRole: Defines what actions are allowed on which resources.
  • ServiceAccount: An identity for pods or external tools.
  • RoleBinding/ClusterRoleBinding: Links a ServiceAccount to a Role.
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
subjects:
- kind: ServiceAccount
  name: my-service-account
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

Real Production Scenarios

Real-world architecture, system migration, and design challenges.

Hard Lead / Architect Level System Design
Q:

How do you implement security scanning in a GitHub Actions CI/CD pipeline?

A comprehensive security scanning pipeline:

jobs:
  security:
    runs-on: ubuntu-latest
    steps:
      # SAST — Static code analysis
      - uses: actions/checkout@v4
      - name: Run Semgrep
        uses: returntocorp/semgrep-action@v1

      # Dependency scanning
      - name: Run Snyk
        uses: snyk/actions/node@master
        env:
          SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}

      # Container image scanning
      - name: Build image
        run: docker build -t myapp:${{ github.sha }} .
      - name: Run Trivy
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: myapp:${{ github.sha }}
          severity: CRITICAL,HIGH
          exit-code: 1

      # IaC scanning
      - name: Run tfsec
        uses: aquasecurity/tfsec-action@v1.0.0
Hard Lead / Architect Level System Design
Q:

How do you implement secrets rotation without downtime?

Secret rotation is a critical security practice. Zero-downtime rotation process:

  1. Generate new secret without invalidating the old one (e.g., create a new DB user, or generate a new API key that coexists with the old one).
  2. Update secret store (AWS Secrets Manager, Vault) with the new value.
  3. Rotate applications: Applications use External Secrets Operator or Vault Agent to pick up new values. Configure TTL on cached secrets so they refresh within minutes.
  4. Verify: Confirm all services are using the new secret.
  5. Revoke old secret.

AWS Secrets Manager has native rotation with Lambda functions for RDS passwords. This can be fully automated.

Hard Lead / Architect Level System Design
Q:

How do you implement network segmentation for a microservices application?

Network segmentation limits the blast radius of a compromise. In a microservices context:

  1. AWS: Security Groups + VPC design: Place services in private subnets. Use security groups to only allow traffic between services that need to communicate (e.g., allow port 5432 only from the API service to the database SG).
  2. Kubernetes: NetworkPolicies: Default-deny all inter-pod traffic. Explicitly allow only required paths.
  3. Service Mesh (Istio/Linkerd): Mutual TLS (mTLS) between all services — all communication is encrypted and authenticated at the network level. Zero-trust networking.
Hard Lead / Architect Level Linux
Q:

Explain how the Linux kernel handles I/O with the page cache.

The Linux kernel uses the page cache to cache file data in RAM to speed up I/O. When you read a file, the kernel copies it into page cache. Subsequent reads are served from RAM (microseconds) instead of disk (milliseconds).

Writes are also cached: data is written to the page cache first and then persisted to disk asynchronously (write-back). This is why free -h shows most RAM as “used” on a healthy server — the kernel aggressively caches. This is not a memory leak.

Relevant commands: vmstat, iostat, /proc/meminfo (Cached, Buffers), echo 3 > /proc/sys/vm/drop_caches to flush cache (dangerous in production).

Hard Lead / Architect Level Observability
Q:

What is distributed tracing and how do you implement it with OpenTelemetry?

In a microservices architecture, a single user request touches dozens of services. Distributed tracing follows that request across all services, recording timing and metadata at each step.

OpenTelemetry (OTel) is the vendor-neutral standard for instrumentation. Implementation:

  1. Add the OTel SDK to each service.
  2. Services automatically propagate a traceparent header in HTTP calls, linking all spans.
  3. A collector (OTel Collector) receives spans and routes them to your backend (Jaeger, Zipkin, Tempo, Datadog).
  4. You can now visualize the full request path, identify slow spans, and pinpoint errors.
Hard Lead / Architect Level Observability
Q:

How do you avoid alert fatigue in a large-scale microservices environment?

Alert fatigue happens when teams receive too many alerts, many of which are noise. Engineers start ignoring them — including real critical ones.

Strategies to combat it:

  • Symptom-based alerting: Alert on user-facing symptoms (error rate, latency) not causes (CPU high). CPU high does not always mean users are impacted.
  • Actionable alerts only: Every alert must have a clear runbook. If there’s no action to take, it shouldn’t be an alert.
  • SLA-based alerting: Alert when you’re burning through your error budget too fast.
  • Regular alert audits: Review and delete alerts that consistently fire without requiring action.
  • Severity tiers: P1 wakes someone up. P3 creates a ticket. Many alerts should be P3.
Hard Lead / Architect Level AWS
Q:

Explain AWS Lambda cold starts and how to mitigate them in production.

A cold start occurs when Lambda needs to initialize a new execution environment — download the code, start the runtime, run your initialization code. This adds 100ms-1s+ of latency on the first request.

Mitigation strategies:

  • Provisioned Concurrency: Pre-warm a set number of Lambda execution environments. Eliminates cold starts for warmed instances (at extra cost).
  • Minimize package size: Smaller deployment packages initialize faster.
  • Use faster runtimes: Node.js and Python cold start faster than Java/C#.
  • Move init code outside the handler: DB connections and SDK clients initialized at module level persist across invocations.
  • Lambda SnapStart (Java): AWS-managed snapshot of initialized execution environment.
Hard Lead / Architect Level AWS
Q:

How does IAM assume-role work and how do you implement cross-account access securely?

Cross-account access uses the sts:AssumeRole API. A role in Account B has a trust policy that allows Account A to assume it:

# Trust policy on role in Account B
{
  "Principal": {
    "AWS": "arn:aws:iam::ACCOUNT_A_ID:root"
  },
  "Action": "sts:AssumeRole"
}

Account A’s entity calls aws sts assume-role to get temporary credentials (up to 12 hours) for Account B. Security controls:

  • Add ExternalId condition for third-party access (prevents confused deputy attacks)
  • Add MFA condition for sensitive roles
  • Use SCPs at the AWS Organization level to restrict what can be assumed
Hard Lead / Architect Level AWS
Q:

How would you architect a highly available, multi-region AWS deployment?

Multi-region HA involves several layers:

  1. DNS: Route53 with health checks and latency/failover routing policies to direct users to the nearest healthy region.
  2. Data replication: RDS Multi-Region Read Replicas with promotion capability. DynamoDB Global Tables for active-active.
  3. Edge: CloudFront CDN with origins in multiple regions.
  4. Infrastructure: Identical infrastructure in each region managed by Terraform.
  5. DR strategy: Define RTO (Recovery Time Objective) and RPO (Recovery Point Objective) to determine your architecture (Pilot Light, Warm Standby, or Active-Active).
Hard Lead / Architect Level Terraform
Q:

How do you implement Terraform in a CI/CD pipeline safely?

Running Terraform in CI/CD requires careful guardrails:

  1. PR triggers plan: On every pull request, run terraform plan and post the output as a PR comment (using tools like Atlantis or terraform-pr-commenter).
  2. Merge triggers apply: Only apply after PR is merged to main. Require manual approval for production.
  3. State locking: Ensure DynamoDB locking is configured to prevent concurrent applies.
  4. OIDC credentials: Use OIDC to get short-lived tokens from AWS instead of storing long-lived access keys.
  5. Plan artifacts: Save the plan file and apply that exact file — never re-plan at apply time.
Hard Lead / Architect Level Terraform
Q:

What is Terraform state drift and how do you handle it?

State drift occurs when the real infrastructure differs from what Terraform state believes it to be — typically due to manual changes made in the AWS console or another tool.

Detection: terraform plan will show changes that seem unexpected.

Resolution options:

  1. Import: terraform import to import manually created resources into state.
  2. Refresh: terraform refresh to update state to match reality (deprecated in favor of plan -refresh-only).
  3. Accept drift: Use lifecycle { ignore_changes = [...] } for intentionally externally-managed attributes.

Prevention: Forbid all manual console access to production environments using IAM SCPs.

Hard Lead / Architect Level CI/CD
Q:

How do you implement a multi-environment deployment pipeline (dev → staging → prod)?

A professional multi-environment pipeline uses gates between stages:

  1. Build once: A single immutable artifact (Docker image with SHA tag) is promoted — never rebuilt.
  2. Deploy to Dev: Automatic on every merge to main.
  3. Deploy to Staging: Automatic after dev health checks pass. Run integration and smoke tests.
  4. Deploy to Prod: Manual approval gate + scheduled deployment window.

The key is that the same image moves through all environments. This ensures what you tested in staging is exactly what runs in production.

Hard Lead / Architect Level CI/CD
Q:

How do you structure a mono-repo CI/CD pipeline to avoid unnecessary builds?

In a monorepo with 20+ services, you must only trigger builds for services that actually changed. Strategies:

  • Path filters: GitHub Actions paths: filter to trigger workflows only when specific directories change.
  • Nx / Turborepo: Task runners with build graph awareness that skip unchanged services.
  • git diff: Compare changed files against the base branch and only build affected services.
# GitHub Actions path filter
on:
  push:
    paths:
      - "services/api/**"
      - "shared/lib/**"
Hard Lead / Architect Level CI/CD
Q:

How do you secure a CI/CD pipeline from supply chain attacks?

Supply chain attacks (like SolarWinds, XZ Utils) target the build pipeline itself. Defense layers:

  1. Pin action versions: Use commit SHA, not floating tags like @v2. uses: actions/checkout@abc123
  2. SBOM generation: Generate a Software Bill of Materials at build time using Syft.
  3. Image signing: Sign images with Cosign (Sigstore). Verify signatures before deployment.
  4. Least privilege: GitHub Actions tokens should have minimal permissions. Set permissions: read-all by default.
  5. Dependency review: Use Dependabot or Renovate for automated dependency updates.
Hard Lead / Architect Level Docker
Q:

How do you scan Docker images for vulnerabilities in a CI/CD pipeline?

Image scanning should be a mandatory gate before pushing to production. Tools and integration steps:

  • Trivy (Aqua): Fast, comprehensive, easy CI integration. trivy image myapp:latest
  • Snyk: Deep dependency scanning with developer-friendly output.
  • Docker Scout: Built into Docker Hub.
  • Grype: From Anchore, works well with SBOM workflows.
# GitHub Actions example
- name: Scan image with Trivy
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: myapp:${{ github.sha }}
    severity: CRITICAL,HIGH
    exit-code: 1  # Fail the pipeline on critical vulnerabilities
Hard Lead / Architect Level Kubernetes
Q:

Explain Kubernetes network policies and how you would isolate a production namespace.

By default, all pods in a Kubernetes cluster can communicate with each other freely. NetworkPolicies are namespace-scoped firewall rules that control which pods can talk to which.

To enforce full isolation on a namespace, start by denying all ingress and egress, then selectively allow only what’s needed:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

Then add specific allow rules for your database, monitoring agents, and DNS (port 53).

Hard Lead / Architect Level Kubernetes
Q:

How do you manage secrets securely in Kubernetes? What are the alternatives to plain Kubernetes Secrets?

Kubernetes Secrets are base64-encoded, not encrypted by default. For production, consider these approaches:

  • Encryption at Rest: Enable EncryptionConfiguration to encrypt secrets in etcd.
  • External Secrets Operator: Syncs secrets from AWS Secrets Manager, GCP Secret Manager, or HashiCorp Vault into Kubernetes Secrets automatically.
  • HashiCorp Vault Agent Injector: Injects secrets directly into Pod filesystems without storing them in Kubernetes at all.
  • Sealed Secrets: Encrypts secrets client-side so they are safe to commit to Git.
Hard Lead / Architect Level Kubernetes
Q:

How do you implement Zero-Downtime deployments with Kubernetes Service objects?

Discuss RollingUpdate strategies, readiness probes, and the role of Service selectors in traffic routing during a rollout.

Troubleshooting Scenarios

Live system debugging, incident diagnostics, and latency resolution.

Hard Lead / Architect Level Observability
Q:

How do you implement on-call rotation and incident response in an SRE team?

A mature on-call process has these elements:

  • Schedules: PagerDuty or OpsGenie for rotating on-call assignments with escalations.
  • Runbooks: Every alert links to a runbook with investigation steps and common resolutions.
  • Severity levels: P1 (major outage, wake anyone up) → P4 (low impact, business hours only).
  • Incident channels: Dedicated Slack channel per incident. Assign Incident Commander, Communications Lead roles.
  • Postmortems: Blameless postmortem for every P1/P2. Focus on system improvements, not blaming individuals.
  • On-call health: Track toil. If engineers are getting paged more than 2-3 times per shift, the alert quality needs improvement.
Hard Lead / Architect Level Kubernetes
Q:

How do you troubleshoot high memory usage causing OOMKilled events in production?

When a container exceeds its memory limit, the kernel OOM killer terminates it and Kubernetes logs OOMKilled. Steps to resolve:

  1. Identify: kubectl describe pod <pod> — look for Reason: OOMKilled in Last State.
  2. Profile: Use kubectl top pod or Prometheus/Grafana to understand actual memory usage patterns.
  3. Fix: Either increase limits if the app genuinely needs more memory, or find and fix the memory leak in the application code.
  4. Prevent: Set up PrometheusRule or Datadog alerts to notify before a pod hits its limit.
Hard Lead / Architect Level Kubernetes
Q:

How do you debug a pod stuck in CrashLoopBackOff?

CrashLoopBackOff means the container starts but repeatedly crashes. Use this systematic approach:

  1. Check logs: kubectl logs <pod> --previous to see the crash output.
  2. Describe the pod: kubectl describe pod <pod> to inspect Events, resource limits, and probe failures.
  3. Check OOM: If you see OOMKilled, the container exceeded its memory limit.
  4. Shell override: Override the entrypoint to keep the container alive for inspection: command: ["sleep", "3600"]