All DevOps Interview Questions

Browse our comprehensive question bank. Updated regularly with real interview scenarios.

Switch Topic:

Beginner Questions

Core concepts, syntax, and foundational command-line knowledge.

Easy Associate Level System Design
Q:

What is the difference between authentication and authorization?

Authentication (AuthN): Verifying the identity of a user or service. “Who are you?” Authentication happens first — you prove your identity with a password, token, certificate, or biometric.

Authorization (AuthZ): Determining what an authenticated identity is allowed to do. “What can you do?” Authorization happens after authentication — once we know who you are, we check your permissions.

Example in AWS: You authenticate to AWS with your access key (AuthN). Then AWS checks your IAM policies to see if you’re authorized to call s3:PutObject (AuthZ). Both can fail independently.

Easy Associate Level System Design
Q:

What is multi-factor authentication (MFA) and why should it be enforced for cloud accounts?

MFA requires two or more verification factors: something you know (password) + something you have (TOTP app, hardware key) + something you are (biometric). Even if a password is compromised, MFA prevents unauthorized access.

For AWS/cloud accounts:

  • Enforce MFA on the root account immediately and don’t use it routinely
  • Require MFA for IAM users via SCP or IAM policy condition
  • Use hardware MFA keys (YubiKey) for privileged accounts
  • Enable AWS Organizations SCPs to deny API calls unless MFA is present
Easy Associate Level System Design
Q:

What is TLS/SSL and why is it important for DevOps engineers to understand it?

TLS (Transport Layer Security) encrypts communication between clients and servers, preventing eavesdropping and man-in-the-middle attacks. It replaced the deprecated SSL protocol.

DevOps engineers encounter TLS in:

  • Configuring HTTPS for web services (Let’s Encrypt, ACM in AWS)
  • Kubernetes Ingress TLS termination
  • mTLS between microservices (Istio, Linkerd)
  • Certificate rotation — expired certs cause outages
  • Internal PKI for service-to-service auth

Automate certificate renewal with cert-manager in Kubernetes or AWS Certificate Manager. Never let certificates expire manually.

Easy Associate Level Linux
Q:

What is the purpose of /etc/hosts and how does DNS resolution work in Linux?

DNS resolution order in Linux (configured in /etc/nsswitch.conf):

  1. /etc/hosts: Local overrides. Checked first. Maps hostnames to IPs without DNS lookup.
  2. DNS servers (/etc/resolv.conf): The configured nameservers are queried via UDP port 53.

Common use cases for /etc/hosts: local development overrides, blocking domains by pointing to 127.0.0.1, testing service connectivity using a service name before DNS is configured. In containers, Kubernetes manages /etc/hosts via its own CoreDNS system.

Easy Associate Level Linux
Q:

What is the difference between processes and threads in Linux?

A process is an independent program in execution with its own memory space, file descriptors, and system resources. Creating a new process (fork()) is expensive.

A thread is a unit of execution within a process. Threads within the same process share the same memory space and open file descriptors, making communication between them fast. Thread creation is lighter than process creation.

In Linux, threads are implemented as “lightweight processes” and managed with the clone() system call. Tools like htop can show threads per process.

Easy Associate Level Linux
Q:

What is the difference between a hard link and a symbolic (soft) link in Linux?

Hard Link: A directory entry that points directly to the same inode as the original file. Both the original and the hard link are indistinguishable — deleting one doesn’t affect the other. Hard links cannot span filesystems or link to directories.

Symbolic (Soft) Link: A pointer to another file’s path. If the original is deleted, the symlink becomes a broken “dangling” link. Symlinks can cross filesystems and point to directories.

# Hard link
ln original.txt hardlink.txt

# Symbolic link
ln -s /etc/nginx/sites-available/mysite /etc/nginx/sites-enabled/mysite
Easy Associate Level Observability
Q:

What is the difference between monitoring and observability?

Monitoring is about tracking known failure modes. You define metrics and alerts for things you know can go wrong. It answers: “Is this thing I’m watching broken?”

Observability is about understanding system behavior from its outputs. It allows you to answer questions you didn’t think to ask beforehand — debugging novel failures you’ve never seen before.

Monitoring tells you something is wrong. Observability tells you why. You need both, but as systems grow more complex, observability becomes more critical for understanding emergent failures.

Easy Associate Level Observability
Q:

What are the three pillars of observability?

The three pillars of observability are:

  1. Metrics: Numerical measurements aggregated over time (CPU usage, request rate, error rate). Good for dashboards and alerting on trends.
  2. Logs: Timestamped records of discrete events. Good for debugging specific incidents and understanding what happened.
  3. Traces: Records of a request’s journey through a distributed system. Essential for finding bottlenecks and understanding service dependencies in microservices.

Together they answer: Is something wrong? (metrics), What is wrong? (logs), Where and why is it wrong? (traces).

Easy Associate Level AWS
Q:

What is the AWS Shared Responsibility Model?

AWS and customers share security responsibilities — the line depends on the service type:

AWS is responsible for: Security “of” the cloud — physical data centers, hypervisors, networking hardware, managed service infrastructure.

You are responsible for: Security “in” the cloud — your operating systems, your application code, IAM configurations, data encryption, network configuration (VPC, security groups), and patching guest OS on EC2.

For managed services like RDS or Lambda, AWS takes on more responsibility (OS patching), but you still own IAM, data, and network controls.

Easy Associate Level AWS
Q:

What is the difference between S3 Standard, S3 Infrequent Access, and S3 Glacier?

AWS S3 offers storage classes with different cost/access tradeoffs:

  • Standard: High durability, low latency, high throughput. For frequently accessed data.
  • Standard-IA (Infrequent Access): Same latency as Standard but cheaper storage cost. Higher per-retrieval cost. Use for data accessed less than once a month.
  • Glacier Instant Retrieval: For archive data accessed a few times per year. Millisecond retrieval.
  • Glacier Deep Archive: Lowest cost. Retrieval takes 12 hours. Use for compliance/regulatory long-term retention.

Use S3 Lifecycle Policies to automatically transition objects between classes based on age.

Easy Associate Level AWS
Q:

What is the difference between IAM users, groups, roles, and policies in AWS?

Users: Individual identities for people or applications with long-term credentials (access key + secret).

Groups: Collections of users that share the same permissions. Manage permissions at group level, not individually.

Roles: Identities assumed temporarily by AWS services (EC2, Lambda), federated users, or cross-account access. No long-term credentials — they use short-lived tokens. This is the preferred approach.

Policies: JSON documents that define permissions. Attached to users, groups, or roles.

Best practice: Always use roles over users for AWS service authentication.

Easy Associate Level Terraform
Q:

What is Infrastructure as Code (IaC) and what are its main benefits?

Infrastructure as Code means managing and provisioning infrastructure through machine-readable configuration files instead of manual processes.

Key benefits:

  • Reproducibility: Spin up identical environments on demand.
  • Version control: Track all infrastructure changes in Git. Know who changed what and when.
  • Auditability: Compliance teams can review what infrastructure is being provisioned.
  • Self-documentation: The code is the documentation.
  • Disaster recovery: Re-create an entire environment from scratch in minutes.
Easy Associate Level Docker
Q:

What is the difference between Docker COPY and ADD instructions?

Both copy files into the image, but ADD has extra functionality that makes it unpredictable:

  • ADD can fetch files from a URL
  • ADD auto-extracts tar archives into the destination

Best practice: Always use COPY unless you specifically need the URL or auto-extraction features. COPY is explicit and predictable, which is better for reproducible builds.

Easy Associate Level Docker
Q:

What is the purpose of ENTRYPOINT vs CMD in a Dockerfile?

CMD provides default arguments for the container. It can be overridden by passing arguments to docker run.

ENTRYPOINT defines the fixed command that always runs. It cannot be overridden without --entrypoint flag.

Best practice: Use ENTRYPOINT for the executable and CMD for default arguments, making the container behave like a command-line tool:

ENTRYPOINT ["python", "app.py"]
CMD ["--port", "8080"]
# docker run myapp --port 9090  ← overrides CMD only
Easy Associate Level Docker
Q:

What is the difference between a Docker image and a Docker container?

A Docker image is a read-only template built from a Dockerfile. Think of it as a class definition. A container is a running instance of that image — a class instantiation. You can run many containers from the same image, each isolated from the others.

# Build an image
docker build -t my-app:1.0 .

# Run a container from that image
docker run -d -p 8080:80 my-app:1.0
Easy Associate Level Kubernetes
Q:

What is a ConfigMap and when would you use it over an environment variable?

A ConfigMap stores non-sensitive configuration data as key-value pairs. It decouples your configuration from your container image.

Use ConfigMaps over hardcoded env vars when:

  • Config needs to differ between environments (dev/staging/prod)
  • Multiple pods share the same configuration
  • You need to mount config as a file (e.g., nginx.conf, prometheus.yml)

For sensitive data like passwords, use a Secret instead of a ConfigMap.

Real Production Scenarios

Real-world architecture, system migration, and design challenges.

Easy Associate Level System Design
Q:

What is the principle of least privilege and why is it critical in DevOps?

The principle of least privilege (PoLP) states that any user, process, or service should only have the minimum permissions necessary to perform its function — nothing more.

In DevOps this applies to:

  • IAM roles: A Lambda function that reads from S3 should only have s3:GetObject on that specific bucket, not full S3 access.
  • Kubernetes RBAC: A deployment automation service account only needs update permissions on Deployments, not cluster-admin.
  • CI/CD tokens: A build token should be able to push to a registry but not manage IAM users.

Blast radius reduction: if credentials are compromised, least privilege limits what an attacker can do.

Easy Associate Level Linux
Q:

What is the difference between SSH key authentication and password authentication?

Password authentication: User provides a password. Vulnerable to brute-force attacks, password spraying, and phishing. Should be disabled for SSH in production.

SSH Key authentication: The client proves ownership of a private key without ever transmitting it. The server holds the public key in ~/.ssh/authorized_keys. Private key never leaves the client.

# Generate key pair
ssh-keygen -t ed25519 -C "anmol@devopsinterview.com"

# Copy public key to server
ssh-copy-id -i ~/.ssh/id_ed25519.pub user@server

# Disable password auth in /etc/ssh/sshd_config
PasswordAuthentication no

Use ed25519 keys — they are faster and more secure than RSA 2048.

Easy Associate Level AWS
Q:

What is the difference between horizontal and vertical scaling in AWS?

Vertical Scaling (Scale Up): Increase the size of an existing instance (e.g., t3.medium → c5.4xlarge). Simple but has a ceiling (there’s a maximum instance size). Requires downtime to resize EC2.

Horizontal Scaling (Scale Out): Add more instances behind a load balancer. No theoretical ceiling. Enables high availability and fault tolerance because traffic is spread across multiple instances in multiple AZs.

AWS Auto Scaling Groups with Application Load Balancers enable fully automated horizontal scaling based on metrics like CPU or custom CloudWatch metrics.

Easy Associate Level Terraform
Q:

What is the purpose of terraform.tfvars files?

terraform.tfvars files provide values for your declared variables, keeping configuration separate from the variable definitions. This allows you to have different values per environment without modifying the core modules.

# variables.tf — defines the variable
variable "instance_type" {
  description = "EC2 instance type"
  type        = string
}

# production.tfvars — provides the value
instance_type = "c5.2xlarge"

# development.tfvars
instance_type = "t3.micro"

Never commit .tfvars files containing sensitive values to Git. Use .gitignore and pass sensitive values via environment variables (TF_VAR_*) in CI/CD.

Easy Associate Level Terraform
Q:

What does terraform plan do and why should you always review it before applying?

terraform plan creates an execution plan — a preview of what Terraform will do before it actually makes changes. It shows additions, modifications, and destructions.

Always review the plan because:

  • It may show unexpected destructions (e.g., a stateful database being replaced instead of modified)
  • It catches misconfiguration before real infrastructure is affected
  • In a CI/CD pipeline, save the plan output and apply that exact plan in the next step to ensure consistency
terraform plan -out=tfplan
terraform apply tfplan
Easy Associate Level CI/CD
Q:

What is a pipeline artifact and what are common examples?

A pipeline artifact is any file produced by a CI/CD job that needs to be passed to downstream jobs or stored for later use.

Common examples:

  • Compiled binary or JAR file (Java/Go)
  • Built Docker image pushed to a registry
  • Frontend build output (dist/ or build/ folder)
  • Test reports and coverage reports
  • SBOM (Software Bill of Materials) files
  • Terraform plan output
Easy Associate Level CI/CD
Q:

Why do you use branch protection rules in a CI/CD workflow?

Branch protection rules on the main or production branch enforce quality gates before any code is merged:

  • Require pull request reviews (at least 1-2 approvals)
  • Require status checks to pass (CI build, tests, linting)
  • Require branches to be up to date before merging
  • Prevent force pushes and branch deletion

This ensures no untested or unreviewed code ever reaches production, which is the foundation of a trustworthy deployment pipeline.

Easy Associate Level CI/CD
Q:

What is the difference between Continuous Integration, Continuous Delivery, and Continuous Deployment?

Continuous Integration (CI): Developers merge code frequently (multiple times a day). Every merge triggers an automated build and test run to catch integration issues early.

Continuous Delivery (CD): Every passing build is automatically prepared for release to production. A human approves the final deployment step.

Continuous Deployment: Extends Delivery — every passing build is automatically deployed to production with no human intervention.

Easy Associate Level Docker
Q:

What is Docker Compose and when would you use it?

Docker Compose is a tool for defining and running multi-container applications using a YAML file. It is ideal for local development and testing where you need to spin up interdependent services (app + database + cache) with a single command.

docker compose up -d

It handles networking (all services in the same file can reach each other by service name), volume management, and environment variables. For production orchestration, use Kubernetes instead.

Easy Associate Level Kubernetes
Q:

What is the difference between a Pod and a Deployment in Kubernetes?

A Pod is the smallest deployable unit in Kubernetes — it wraps one or more containers that share the same network and storage. However, Pods on their own are ephemeral.

A Deployment is a higher-level abstraction that manages Pods. It ensures a specified number of Pod replicas are running at all times, handles rolling updates, and allows rollbacks. You almost never create bare Pods in production; you use Deployments instead.

kubectl create deployment nginx --image=nginx:1.25 --replicas=3
Easy Associate Level Kubernetes
Q:

Explain the role of ‘Sidecar’ containers in Kubernetes pod architecture.

A sidecar container is a secondary container that runs along with the main application container within the same pod. It is used to extend and enhance the functionality of the main container, such as by providing logging, monitoring, or proxy services.

Troubleshooting Scenarios

Live system debugging, incident diagnostics, and latency resolution.

Easy Associate Level Observability
Q:

What is an error budget and how do SRE teams use it?

An error budget is the allowable amount of unreliability in a service, derived from the SLO. If your SLO is 99.9% uptime, your error budget is 0.1% — about 43 minutes of downtime per month.

How teams use it:

  • When error budget is healthy → deploy freely, take risks, ship features.
  • When error budget is low → slow down deployments, prioritize reliability work.
  • When budget is exhausted → freeze all non-critical deployments until reliability improves.

Error budgets create a shared language between product (wants to ship) and SRE (wants reliability). It’s objective, not political.

Easy Associate Level Kubernetes
Q:

What are resource requests and limits in Kubernetes, and why are they important?

Requests tell the Kubernetes scheduler how much CPU/memory to reserve for a pod when scheduling it onto a node. Limits are the hard caps — the container is throttled (CPU) or killed (memory) if it exceeds them.

resources:
  requests:
    memory: "128Mi"
    cpu: "250m"
  limits:
    memory: "256Mi"
    cpu: "500m"

Always set both. Without requests, the scheduler cannot make good placement decisions. Without limits, a runaway container can starve other workloads on the same node (the “noisy neighbor” problem).