Beginner Questions
Core concepts, syntax, and foundational command-line knowledge.
What is the difference between authentication and authorization?
Authentication (AuthN): Verifying the identity of a user or service. “Who are you?” Authentication happens first — you prove your identity with a password, token, certificate, or biometric.
Authorization (AuthZ): Determining what an authenticated identity is allowed to do. “What can you do?” Authorization happens after authentication — once we know who you are, we check your permissions.
Example in AWS: You authenticate to AWS with your access key (AuthN). Then AWS checks your IAM policies to see if you’re authorized to call s3:PutObject (AuthZ). Both can fail independently.
What is multi-factor authentication (MFA) and why should it be enforced for cloud accounts?
MFA requires two or more verification factors: something you know (password) + something you have (TOTP app, hardware key) + something you are (biometric). Even if a password is compromised, MFA prevents unauthorized access.
For AWS/cloud accounts:
- Enforce MFA on the root account immediately and don’t use it routinely
- Require MFA for IAM users via SCP or IAM policy condition
- Use hardware MFA keys (YubiKey) for privileged accounts
- Enable AWS Organizations SCPs to deny API calls unless MFA is present
What is TLS/SSL and why is it important for DevOps engineers to understand it?
TLS (Transport Layer Security) encrypts communication between clients and servers, preventing eavesdropping and man-in-the-middle attacks. It replaced the deprecated SSL protocol.
DevOps engineers encounter TLS in:
- Configuring HTTPS for web services (Let’s Encrypt, ACM in AWS)
- Kubernetes Ingress TLS termination
- mTLS between microservices (Istio, Linkerd)
- Certificate rotation — expired certs cause outages
- Internal PKI for service-to-service auth
Automate certificate renewal with cert-manager in Kubernetes or AWS Certificate Manager. Never let certificates expire manually.
What is the purpose of /etc/hosts and how does DNS resolution work in Linux?
DNS resolution order in Linux (configured in /etc/nsswitch.conf):
- /etc/hosts: Local overrides. Checked first. Maps hostnames to IPs without DNS lookup.
- DNS servers (
/etc/resolv.conf): The configured nameservers are queried via UDP port 53.
Common use cases for /etc/hosts: local development overrides, blocking domains by pointing to 127.0.0.1, testing service connectivity using a service name before DNS is configured. In containers, Kubernetes manages /etc/hosts via its own CoreDNS system.
What is the difference between processes and threads in Linux?
A process is an independent program in execution with its own memory space, file descriptors, and system resources. Creating a new process (fork()) is expensive.
A thread is a unit of execution within a process. Threads within the same process share the same memory space and open file descriptors, making communication between them fast. Thread creation is lighter than process creation.
In Linux, threads are implemented as “lightweight processes” and managed with the clone() system call. Tools like htop can show threads per process.
What is the difference between a hard link and a symbolic (soft) link in Linux?
Hard Link: A directory entry that points directly to the same inode as the original file. Both the original and the hard link are indistinguishable — deleting one doesn’t affect the other. Hard links cannot span filesystems or link to directories.
Symbolic (Soft) Link: A pointer to another file’s path. If the original is deleted, the symlink becomes a broken “dangling” link. Symlinks can cross filesystems and point to directories.
# Hard link
ln original.txt hardlink.txt
# Symbolic link
ln -s /etc/nginx/sites-available/mysite /etc/nginx/sites-enabled/mysite
What is the difference between monitoring and observability?
Monitoring is about tracking known failure modes. You define metrics and alerts for things you know can go wrong. It answers: “Is this thing I’m watching broken?”
Observability is about understanding system behavior from its outputs. It allows you to answer questions you didn’t think to ask beforehand — debugging novel failures you’ve never seen before.
Monitoring tells you something is wrong. Observability tells you why. You need both, but as systems grow more complex, observability becomes more critical for understanding emergent failures.
What are the three pillars of observability?
The three pillars of observability are:
- Metrics: Numerical measurements aggregated over time (CPU usage, request rate, error rate). Good for dashboards and alerting on trends.
- Logs: Timestamped records of discrete events. Good for debugging specific incidents and understanding what happened.
- Traces: Records of a request’s journey through a distributed system. Essential for finding bottlenecks and understanding service dependencies in microservices.
Together they answer: Is something wrong? (metrics), What is wrong? (logs), Where and why is it wrong? (traces).
What is the AWS Shared Responsibility Model?
AWS and customers share security responsibilities — the line depends on the service type:
AWS is responsible for: Security “of” the cloud — physical data centers, hypervisors, networking hardware, managed service infrastructure.
You are responsible for: Security “in” the cloud — your operating systems, your application code, IAM configurations, data encryption, network configuration (VPC, security groups), and patching guest OS on EC2.
For managed services like RDS or Lambda, AWS takes on more responsibility (OS patching), but you still own IAM, data, and network controls.
What is the difference between S3 Standard, S3 Infrequent Access, and S3 Glacier?
AWS S3 offers storage classes with different cost/access tradeoffs:
- Standard: High durability, low latency, high throughput. For frequently accessed data.
- Standard-IA (Infrequent Access): Same latency as Standard but cheaper storage cost. Higher per-retrieval cost. Use for data accessed less than once a month.
- Glacier Instant Retrieval: For archive data accessed a few times per year. Millisecond retrieval.
- Glacier Deep Archive: Lowest cost. Retrieval takes 12 hours. Use for compliance/regulatory long-term retention.
Use S3 Lifecycle Policies to automatically transition objects between classes based on age.
What is the difference between IAM users, groups, roles, and policies in AWS?
Users: Individual identities for people or applications with long-term credentials (access key + secret).
Groups: Collections of users that share the same permissions. Manage permissions at group level, not individually.
Roles: Identities assumed temporarily by AWS services (EC2, Lambda), federated users, or cross-account access. No long-term credentials — they use short-lived tokens. This is the preferred approach.
Policies: JSON documents that define permissions. Attached to users, groups, or roles.
Best practice: Always use roles over users for AWS service authentication.
What is Infrastructure as Code (IaC) and what are its main benefits?
Infrastructure as Code means managing and provisioning infrastructure through machine-readable configuration files instead of manual processes.
Key benefits:
- Reproducibility: Spin up identical environments on demand.
- Version control: Track all infrastructure changes in Git. Know who changed what and when.
- Auditability: Compliance teams can review what infrastructure is being provisioned.
- Self-documentation: The code is the documentation.
- Disaster recovery: Re-create an entire environment from scratch in minutes.
What is the difference between Docker COPY and ADD instructions?
Both copy files into the image, but ADD has extra functionality that makes it unpredictable:
ADDcan fetch files from a URLADDauto-extracts tar archives into the destination
Best practice: Always use COPY unless you specifically need the URL or auto-extraction features. COPY is explicit and predictable, which is better for reproducible builds.
What is the purpose of ENTRYPOINT vs CMD in a Dockerfile?
CMD provides default arguments for the container. It can be overridden by passing arguments to docker run.
ENTRYPOINT defines the fixed command that always runs. It cannot be overridden without --entrypoint flag.
Best practice: Use ENTRYPOINT for the executable and CMD for default arguments, making the container behave like a command-line tool:
ENTRYPOINT ["python", "app.py"]
CMD ["--port", "8080"]
# docker run myapp --port 9090 ← overrides CMD only
What is the difference between a Docker image and a Docker container?
A Docker image is a read-only template built from a Dockerfile. Think of it as a class definition. A container is a running instance of that image — a class instantiation. You can run many containers from the same image, each isolated from the others.
# Build an image
docker build -t my-app:1.0 .
# Run a container from that image
docker run -d -p 8080:80 my-app:1.0
What is a ConfigMap and when would you use it over an environment variable?
A ConfigMap stores non-sensitive configuration data as key-value pairs. It decouples your configuration from your container image.
Use ConfigMaps over hardcoded env vars when:
- Config needs to differ between environments (dev/staging/prod)
- Multiple pods share the same configuration
- You need to mount config as a file (e.g., nginx.conf, prometheus.yml)
For sensitive data like passwords, use a Secret instead of a ConfigMap.
Intermediate Questions
Infrastructure management, deployment strategies, and delivery flows.
What is a WAF and when should you use AWS WAF vs Cloudflare?
A Web Application Firewall (WAF) filters and monitors HTTP traffic to protect against common attacks: SQL injection, XSS, DDoS, bad bots.
AWS WAF: Tight integration with CloudFront, ALB, API Gateway. Managed rule groups for OWASP, AWS managed rules. Good if you’re AWS-native. Can use IP reputation lists and rate-limiting rules.
Cloudflare: Operates at the DNS/edge level before traffic reaches AWS. Better DDoS mitigation due to Cloudflare’s massive global network. Simpler setup. Bot management is more mature.
In practice: Use Cloudflare as the outer layer for DDoS and global edge, then AWS WAF at the ALB for application-layer filtering. Defense in depth.
What is a CVE, and how do you track and remediate vulnerabilities in your infrastructure?
A CVE (Common Vulnerabilities and Exposures) is a public identifier for a known security vulnerability. Each CVE has a severity score (CVSS 0-10).
Tracking and remediation workflow:
- Discovery: Continuous scanning — Trivy/Snyk in CI for container images, Dependabot for code dependencies, AWS Inspector for EC2.
- Triage: Not all CVEs require immediate action. Prioritize by CVSS score, exploitability, and whether the vulnerable code path is actually used.
- Remediation: Update base image, update dependency, or apply vendor patch.
- Tracking: Log CVEs in your ticketing system with SLA (e.g., Critical = 24h, High = 7 days).
Explain file permissions in Linux (rwx, octal notation) and when to use sticky bit/setuid.
Linux file permissions have three sets: owner, group, others. Each can have: read (4), write (2), execute (1).
-rwxr-xr-- = 754
# Owner: rwx (7), Group: r-x (5), Others: r-- (4)
chmod 755 script.sh # Standard executable
chmod 644 config.yml # Standard config file
Special bits:
- Sticky bit (1xxx): On directories (e.g.,
/tmp), only the file owner can delete their own files:chmod +t /shared - Setuid (4xxx): File executes with the owner’s permissions (used by
/usr/bin/passwdto write/etc/shadowas root). Use with extreme caution.
How do you write effective Prometheus alerting rules?
Effective Prometheus alerts follow these principles:
groups:
- name: api-alerts
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.05
for: 5m # Must be true for 5 minutes before firing
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }}"
runbook: "https://wiki.internal/runbooks/high-error-rate"
Key practices: Use for to avoid alerting on momentary spikes. Always include a runbook link. Use human-readable messages with $labels and $value.
What is Prometheus and how does its pull-based model differ from push-based monitoring?
Prometheus is an open-source metrics monitoring system with a time-series database.
Pull-based (Prometheus): Prometheus actively scrapes metrics from targets at regular intervals. Targets expose a /metrics HTTP endpoint. Benefits: Prometheus controls the scraping schedule, easy to detect if a target is down, no credentials needed on the target side.
Push-based (StatsD, Graphite): Applications push metrics to a central collector. Better for short-lived jobs (like batch scripts) that may end before Prometheus scrapes them. Use Prometheus Pushgateway for these use cases.
What is an SLO, SLA, and SLI, and how do they relate to each other?
SLI (Service Level Indicator): An actual measurement of service behavior. Example: the percentage of successful HTTP requests.
SLO (Service Level Objective): The target for your SLI. Example: 99.9% of requests should succeed in the last 30 days.
SLA (Service Level Agreement): A contractual commitment to the SLO with defined consequences for missing it. Example: If availability drops below 99.9%, AWS credits customers.
In practice: define SLIs → set SLO targets → the SLA is what you promise externally. Your internal error budget is 100% - SLO.
What is the difference between a Prometheus Gauge, Counter, and Histogram metric type?
Counter: A cumulative value that only increases (or resets to zero on restart). Use for: total requests, total errors, bytes sent. Never use for values that can go down.
Gauge: A value that can go up or down. Use for: current memory usage, active connections, queue depth, temperature.
Histogram: Samples observations and counts them in configurable buckets. Use for: request latency, response sizes. Allows you to calculate percentiles (p50, p95, p99) — critical for SLOs.
What is AWS ECS and when would you choose it over EKS?
ECS (Elastic Container Service) is AWS’s native container orchestrator. EKS (Elastic Kubernetes Service) is managed Kubernetes.
Choose ECS when:
- Your team is AWS-native and doesn’t have Kubernetes expertise
- You want lower operational overhead (no Kubernetes control plane concepts to manage)
- Tight AWS service integration is a priority (IAM roles per task, ALB integration is simpler)
Choose EKS when:
- You need Kubernetes-native features (CRDs, Operators, Helm ecosystem)
- You have multi-cloud or hybrid requirements
- Your team already has Kubernetes expertise
Explain AWS VPC and its core components (subnets, route tables, IGW, NAT).
A VPC (Virtual Private Cloud) is your isolated network within AWS.
- Subnets: Subdivisions of your VPC in a specific AZ. Public subnets have a route to the IGW; private subnets do not.
- Route Tables: Rules defining where traffic is directed. A public subnet’s route table has
0.0.0.0/0 → IGW. - Internet Gateway (IGW): Allows public subnets to communicate with the internet.
- NAT Gateway: Allows private subnets to make outbound internet requests (e.g., pulling packages) without exposing them to inbound internet traffic.
What is the difference between an AWS Security Group and a Network ACL?
Security Groups (SGs): Stateful firewalls at the instance level. If you allow inbound traffic, the corresponding outbound response is automatically allowed. Rules are allow-only (no deny rules).
Network ACLs (NACLs): Stateless firewalls at the subnet level. You must explicitly allow both inbound and outbound traffic. Rules are evaluated in order (by rule number) and support both allow and deny.
In practice: Use Security Groups for most use cases. Use NACLs as an additional layer for blocking specific IP ranges (e.g., blocking a bad actor’s IP at the subnet boundary).
How do you handle sensitive values like passwords in Terraform without exposing them in state?
Terraform state files contain sensitive values in plaintext — this is a known limitation. Mitigations:
- Mark as sensitive:
sensitive = trueon variables and outputs prevents them from appearing in CLI output. - Avoid storing in state: Use AWS Secrets Manager or Vault to generate and store secrets externally. Reference via data source or environment variable.
- Encrypt state: S3 backend with server-side encryption (SSE-KMS).
- Restrict access: The S3 bucket containing state should have strict IAM policies — only CI/CD roles should have access.
How do Terraform modules work and what makes a good module?
A Terraform module is a reusable group of resource configurations. Every directory with .tf files is a module. You call modules from a root module to avoid repeating code.
What makes a good module:
- Single responsibility: One module for VPC, another for EKS, another for RDS.
- Parameterized: Accept variables to customize behavior per environment.
- Versioned: Pin module versions in the
sourceattribute. - Outputs: Expose useful outputs (VPC ID, subnet IDs) for other modules to consume.
What is Terraform state and why must it be stored remotely in a team environment?
Terraform state is a JSON file (terraform.tfstate) that maps your configuration to real-world resources. Terraform uses it to know what already exists before planning changes.
Storing it locally breaks team collaboration:
- Team members would each have different state files causing conflicts
- State file gets lost if the local machine breaks
- No locking mechanism — two engineers could run
applysimultaneously and corrupt state
Remote backends (S3 + DynamoDB for locking, GCS, Terraform Cloud) solve all three problems.
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-lock"
}
}
Explain the concept of a distroless image and its security benefits.
A distroless image contains only your application and its runtime dependencies — no shell, no package manager, no OS utilities. This comes from Google’s distroless project.
Security benefits: You cannot exec into a distroless container and run arbitrary commands. The attack surface is dramatically reduced because there are no standard Unix tools an attacker could use to move laterally.
# Distroless multi-stage example
FROM golang:1.22 AS builder
WORKDIR /app
COPY . .
RUN CGO_ENABLED=0 go build -o server .
FROM gcr.io/distroless/static-debian12
COPY --from=builder /app/server /server
CMD ["/server"]
How do you reduce Docker image size? Walk through your optimization strategy.
Image size directly affects pull times and attack surface. Key strategies:
- Use minimal base images:
alpineordistrolessinstead ofubuntu. - Multi-stage builds: Build in a full image, copy only the binary/artifact to a slim final image.
- Combine RUN commands: Each RUN creates a layer. Chain commands with
&&and clean up in the same layer. - Use .dockerignore: Exclude
node_modules,.git, test files from the build context.
# Multi-stage example
FROM node:20 AS builder
WORKDIR /app
COPY . .
RUN npm ci && npm run build
FROM node:20-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
CMD ["node", "dist/index.js"]
Explain the difference between a Liveness probe, Readiness probe, and Startup probe.
Liveness Probe: Checks if the container is alive. If it fails, Kubernetes restarts the container. Use this to recover from deadlocks.
Readiness Probe: Checks if the container is ready to serve traffic. If it fails, the pod is removed from Service endpoints (no traffic sent). Use this during slow startup or when temporarily overloaded.
Startup Probe: Only runs at startup. Allows slow-starting containers enough time to initialize before liveness checks begin. Prevents liveness probes from killing a pod that is simply starting up slowly.
What is a Kubernetes Ingress and how does it differ from a Service?
A Service exposes a set of pods internally or as a simple LoadBalancer. An Ingress is a Layer-7 (HTTP/HTTPS) routing rule that sits in front of multiple services and routes traffic based on hostname or path.
Example: Route api.example.com to the api-service and example.com to the frontend-service using a single load balancer IP. This is far more cost-effective than having a separate LoadBalancer service for each microservice.
Advanced Questions
Enterprise orchestration, deep architectural concepts, and scaling issues.
Explain the OWASP Top 10 and which items are most relevant to DevOps engineers.
The OWASP Top 10 are the most critical web application security risks. Most relevant to DevOps:
- A01: Broken Access Control — Enforce least privilege in IAM, K8s RBAC. Verify RBAC policies in code review.
- A05: Security Misconfiguration — Public S3 buckets, default credentials, exposed management ports. Caught by infrastructure scanning tools like Checkov, tfsec.
- A06: Vulnerable Components — Use Dependabot and Trivy to catch outdated dependencies with known CVEs.
- A09: Security Logging Failures — Ensure CloudTrail, K8s audit logs, and application audit logs are enabled and shipped to a SIEM.
What is a Load Average in Linux and how do you interpret it?
Load average in top or uptime shows three numbers: 1-minute, 5-minute, and 15-minute averages of the number of processes in a runnable or uninterruptible state.
Interpretation depends on the number of CPU cores. On a 4-core server:
- Load average of 4.0 = 100% utilization — every CPU busy but nothing waiting
- Load average of 8.0 = 200% utilization — 4 CPUs busy, 4 processes waiting in queue
- Load average of 0.5 = 12.5% utilization — plenty of headroom
Key insight: High load average is NOT always CPU. Uninterruptible sleep (disk I/O wait) also counts. Check iostat to distinguish CPU saturation from I/O saturation.
What are Linux namespaces and cgroups, and how do they enable container isolation?
Namespaces provide isolation for system resources so each container sees its own view of the system:
pid— isolated process tree (container sees its own PIDs starting at 1)net— isolated network stack (own IP, routing table)mnt— isolated filesystem mountsuts— isolated hostnameuser— isolated user/group IDs
cgroups (Control Groups) limit and account for resource usage (CPU, memory, I/O) per group of processes. This is how Docker enforces your CPU/memory limits.
Together: namespaces provide isolation (what can be seen), cgroups provide resource limits (how much can be used).
How do you implement least-privilege IAM policies and why is it critical?
Least-privilege means granting only the exact permissions needed to perform a task — no more. This limits blast radius if credentials are compromised.
Implementation steps:
- Start with deny-all, add allows: Begin with minimal permissions and add only what’s needed.
- IAM Access Analyzer: Use to identify unused permissions and generate least-privilege policies based on CloudTrail logs.
- Policy conditions: Add
StringEqualsconditions to restrict resources by tag, region, or account. - Permission boundaries: Cap the maximum permissions a principal can have, even if attached policies are more permissive.
"Condition": {
"StringEquals": {
"aws:RequestedRegion": "us-east-1"
}
}
What are Terraform providers and how do you handle provider version pinning?
Providers are plugins that translate Terraform configuration into API calls to AWS, GCP, Azure, etc. Always pin provider versions to prevent unexpected changes from provider upgrades:
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0" # Allows 5.x but not 6.x
}
}
required_version = ">= 1.7.0"
}
provider "aws" {
region = "us-east-1"
}
Run terraform providers lock to generate a .terraform.lock.hcl file that locks exact versions and checksums. Commit this file to Git.
How do you manage multiple environments (dev/staging/prod) in Terraform? Workspaces vs. directory structure.
Two main approaches:
Terraform Workspaces: Use the same code but switch workspace to change state. Simple, but the same code runs for all environments — hard to have different variable values per environment. Suitable for simple differences.
Separate Directories (recommended): Each environment has its own directory with its own terraform.tfvars and remote state. This is explicit, auditable, and allows environments to diverge safely.
environments/
dev/
main.tf → calls shared module
terraform.tfvars
staging/
main.tf
terraform.tfvars
prod/
main.tf
terraform.tfvars
modules/
vpc/
eks/
How do you implement health checks in Docker and why are they important for orchestration?
The HEALTHCHECK instruction tells Docker how to test if a container is working correctly. Without it, Docker considers a container healthy as soon as the process starts — even if the app inside has crashed.
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD curl -f http://localhost:8080/health || exit 1
In Kubernetes, this is replaced by Liveness and Readiness probes. In Docker Compose or standalone Docker, HEALTHCHECK is critical for orchestration tools to know whether to send traffic to a container.
How would you run containers as a non-root user for security hardening?
Running containers as root is a significant security risk. If an attacker escapes the container, they have root on the host. Harden your images:
FROM node:20-alpine
# Create a non-root user
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
# Set working directory and permissions
WORKDIR /app
COPY --chown=appuser:appgroup . .
# Switch to non-root user
USER appuser
CMD ["node", "index.js"]
Also enforce this at the Kubernetes level with a SecurityContext: runAsNonRoot: true.
Explain Kubernetes RBAC and how you would give a service account read-only access to pods.
RBAC (Role-Based Access Control) is the authorization mechanism in Kubernetes. It uses three objects:
- Role/ClusterRole: Defines what actions are allowed on which resources.
- ServiceAccount: An identity for pods or external tools.
- RoleBinding/ClusterRoleBinding: Links a ServiceAccount to a Role.
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-reader
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: read-pods
subjects:
- kind: ServiceAccount
name: my-service-account
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.io
Real Production Scenarios
Real-world architecture, system migration, and design challenges.
What is Zero Trust Architecture and how does it apply to DevOps?
Zero Trust is a security model based on “never trust, always verify.” Traditional networks trusted everything inside the perimeter. Zero trust assumes the network is already compromised.
Zero Trust principles in DevOps:
- Identity-based access: Every service authenticates. No implicit trust based on network location.
- Least privilege: Minimal permissions for every identity, re-evaluated regularly.
- Micro-segmentation: Kubernetes NetworkPolicies and service meshes with mTLS between every service.
- Device trust: Verify developer machines with fleet management (Jamf, Intune) before allowing access to internal systems.
- Continuous verification: Short-lived credentials. Re-authenticate frequently.
How do you implement security scanning in a GitHub Actions CI/CD pipeline?
A comprehensive security scanning pipeline:
jobs:
security:
runs-on: ubuntu-latest
steps:
# SAST — Static code analysis
- uses: actions/checkout@v4
- name: Run Semgrep
uses: returntocorp/semgrep-action@v1
# Dependency scanning
- name: Run Snyk
uses: snyk/actions/node@master
env:
SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
# Container image scanning
- name: Build image
run: docker build -t myapp:${{ github.sha }} .
- name: Run Trivy
uses: aquasecurity/trivy-action@master
with:
image-ref: myapp:${{ github.sha }}
severity: CRITICAL,HIGH
exit-code: 1
# IaC scanning
- name: Run tfsec
uses: aquasecurity/tfsec-action@v1.0.0
What is a bastion host (jump server) and what are the modern alternatives?
A bastion host is a dedicated, hardened server in a public subnet used as the only entry point for SSH/RDP into private subnet resources. All access is logged and audited.
Modern, better alternatives:
- AWS Systems Manager Session Manager: SSH into EC2 over HTTPS through the AWS API. No open port 22 required. All sessions logged to CloudWatch/S3. IAM-controlled access.
- Teleport: Open-source access platform with MFA, session recording, and role-based access for SSH, Kubernetes, databases, and web applications.
- Tailscale / WireGuard: Zero-config VPN mesh that avoids exposing any servers publicly.
How do you implement secrets rotation without downtime?
Secret rotation is a critical security practice. Zero-downtime rotation process:
- Generate new secret without invalidating the old one (e.g., create a new DB user, or generate a new API key that coexists with the old one).
- Update secret store (AWS Secrets Manager, Vault) with the new value.
- Rotate applications: Applications use External Secrets Operator or Vault Agent to pick up new values. Configure TTL on cached secrets so they refresh within minutes.
- Verify: Confirm all services are using the new secret.
- Revoke old secret.
AWS Secrets Manager has native rotation with Lambda functions for RDS passwords. This can be fully automated.
How do you implement network segmentation for a microservices application?
Network segmentation limits the blast radius of a compromise. In a microservices context:
- AWS: Security Groups + VPC design: Place services in private subnets. Use security groups to only allow traffic between services that need to communicate (e.g., allow port 5432 only from the API service to the database SG).
- Kubernetes: NetworkPolicies: Default-deny all inter-pod traffic. Explicitly allow only required paths.
- Service Mesh (Istio/Linkerd): Mutual TLS (mTLS) between all services — all communication is encrypted and authenticated at the network level. Zero-trust networking.
What is SAST vs DAST and where do they fit in a DevSecOps pipeline?
SAST (Static Application Security Testing): Analyzes source code without executing it. Runs early in CI (on every commit/PR). Tools: Semgrep, SonarQube, Bandit (Python), gosec (Go). Fast, no running application needed.
DAST (Dynamic Application Security Testing): Tests the running application by sending malicious inputs and analyzing responses. Runs against a deployed staging environment. Tools: OWASP ZAP, Burp Suite. Finds runtime vulnerabilities that SAST misses (SQL injection, auth bypass).
DevSecOps pipeline: SAST on PR → build image → Trivy scan → deploy to staging → DAST → promote to prod.
What is the principle of least privilege and why is it critical in DevOps?
The principle of least privilege (PoLP) states that any user, process, or service should only have the minimum permissions necessary to perform its function — nothing more.
In DevOps this applies to:
- IAM roles: A Lambda function that reads from S3 should only have
s3:GetObjecton that specific bucket, not full S3 access. - Kubernetes RBAC: A deployment automation service account only needs update permissions on Deployments, not cluster-admin.
- CI/CD tokens: A build token should be able to push to a registry but not manage IAM users.
Blast radius reduction: if credentials are compromised, least privilege limits what an attacker can do.
What is the difference between SSH key authentication and password authentication?
Password authentication: User provides a password. Vulnerable to brute-force attacks, password spraying, and phishing. Should be disabled for SSH in production.
SSH Key authentication: The client proves ownership of a private key without ever transmitting it. The server holds the public key in ~/.ssh/authorized_keys. Private key never leaves the client.
# Generate key pair
ssh-keygen -t ed25519 -C "anmol@devopsinterview.com"
# Copy public key to server
ssh-copy-id -i ~/.ssh/id_ed25519.pub user@server
# Disable password auth in /etc/ssh/sshd_config
PasswordAuthentication no
Use ed25519 keys — they are faster and more secure than RSA 2048.
How do you use awk, sed, and grep together to parse log files?
These three tools form the backbone of Linux log analysis:
# grep: Filter lines containing "ERROR"
grep "ERROR" /var/log/app.log
# awk: Extract specific fields (e.g., column 3 of an NGINX access log)
awk '{print $3}' /var/log/nginx/access.log
# sed: Replace or transform text
sed 's/ERROR/CRITICAL/g' app.log
# Combined pipeline: Find ERROR lines, extract IP (field 1), count by IP
grep "ERROR" /var/log/nginx/access.log \
| awk '{print $1}' \
| sort \
| uniq -c \
| sort -rn \
| head -10
Explain how the Linux kernel handles I/O with the page cache.
The Linux kernel uses the page cache to cache file data in RAM to speed up I/O. When you read a file, the kernel copies it into page cache. Subsequent reads are served from RAM (microseconds) instead of disk (milliseconds).
Writes are also cached: data is written to the page cache first and then persisted to disk asynchronously (write-back). This is why free -h shows most RAM as “used” on a healthy server — the kernel aggressively caches. This is not a memory leak.
Relevant commands: vmstat, iostat, /proc/meminfo (Cached, Buffers), echo 3 > /proc/sys/vm/drop_caches to flush cache (dangerous in production).
Write a Bash script to find and delete log files older than 30 days.
#!/bin/bash
# Delete log files older than 30 days in /var/log/myapp
LOG_DIR="/var/log/myapp"
DAYS=30
DRY_RUN=false # Set to false to actually delete
if [ ! -d "$LOG_DIR" ]; then
echo "Directory $LOG_DIR does not exist"
exit 1
fi
if [ "$DRY_RUN" = true ]; then
echo "Dry run — files that would be deleted:"
find "$LOG_DIR" -name "*.log" -mtime +$DAYS -print
else
echo "Deleting log files older than $DAYS days..."
find "$LOG_DIR" -name "*.log" -mtime +$DAYS -delete
echo "Done. Freed up space:"
df -h "$LOG_DIR"
fi
Always implement a dry run mode. Schedule this with cron or use logrotate for production systems.
What is log aggregation and how do you implement it with the ELK stack?
Log aggregation centralizes logs from all services into one searchable system. The ELK Stack:
- Elasticsearch: Distributed search and analytics engine that indexes and stores logs.
- Logstash: Data processing pipeline that ingests, transforms, and forwards logs.
- Kibana: Web UI for searching, visualizing, and creating dashboards from Elasticsearch data.
Modern replacement: The EFK Stack uses Fluent Bit (lightweight, lower memory than Logstash) as a DaemonSet in Kubernetes to collect container logs and forward to Elasticsearch. Or use Loki (from Grafana Labs) for a simpler, cost-effective log aggregation layer.
What is distributed tracing and how do you implement it with OpenTelemetry?
In a microservices architecture, a single user request touches dozens of services. Distributed tracing follows that request across all services, recording timing and metadata at each step.
OpenTelemetry (OTel) is the vendor-neutral standard for instrumentation. Implementation:
- Add the OTel SDK to each service.
- Services automatically propagate a
traceparentheader in HTTP calls, linking all spans. - A collector (OTel Collector) receives spans and routes them to your backend (Jaeger, Zipkin, Tempo, Datadog).
- You can now visualize the full request path, identify slow spans, and pinpoint errors.
How do you structure a Grafana dashboard for a production service?
A well-structured production dashboard follows the USE or RED methodology:
RED (for services):
- Rate: Requests per second
- Errors: Error rate (%)
- Duration: Latency (p50, p90, p99)
Top-level layout: Start with an SLO summary panel so on-call knows immediately if SLO is being violated. Then drill-down panels: per-endpoint breakdown, error log links, infrastructure metrics (CPU, memory). Use variables for environment and service selection.
How do you avoid alert fatigue in a large-scale microservices environment?
Alert fatigue happens when teams receive too many alerts, many of which are noise. Engineers start ignoring them — including real critical ones.
Strategies to combat it:
- Symptom-based alerting: Alert on user-facing symptoms (error rate, latency) not causes (CPU high). CPU high does not always mean users are impacted.
- Actionable alerts only: Every alert must have a clear runbook. If there’s no action to take, it shouldn’t be an alert.
- SLA-based alerting: Alert when you’re burning through your error budget too fast.
- Regular alert audits: Review and delete alerts that consistently fire without requiring action.
- Severity tiers: P1 wakes someone up. P3 creates a ticket. Many alerts should be P3.
What is the difference between horizontal and vertical scaling in AWS?
Vertical Scaling (Scale Up): Increase the size of an existing instance (e.g., t3.medium → c5.4xlarge). Simple but has a ceiling (there’s a maximum instance size). Requires downtime to resize EC2.
Horizontal Scaling (Scale Out): Add more instances behind a load balancer. No theoretical ceiling. Enables high availability and fault tolerance because traffic is spread across multiple instances in multiple AZs.
AWS Auto Scaling Groups with Application Load Balancers enable fully automated horizontal scaling based on metrics like CPU or custom CloudWatch metrics.
What is AWS CloudWatch and what are its main components?
CloudWatch is AWS’s native observability service with four main areas:
- Metrics: Time-series data from AWS services (CPU, NetworkIn, etc.) and custom metrics you publish.
- Logs: CloudWatch Logs for storing, searching, and analyzing log data from EC2, Lambda, ECS, etc.
- Alarms: Alerts triggered when metrics exceed thresholds. Can trigger SNS, Auto Scaling, Lambda.
- Dashboards: Visual widgets to display metrics across services in real-time.
For advanced analytics, ship logs to OpenSearch (ELK) or use CloudWatch Logs Insights for SQL-like queries.
Explain AWS Lambda cold starts and how to mitigate them in production.
A cold start occurs when Lambda needs to initialize a new execution environment — download the code, start the runtime, run your initialization code. This adds 100ms-1s+ of latency on the first request.
Mitigation strategies:
- Provisioned Concurrency: Pre-warm a set number of Lambda execution environments. Eliminates cold starts for warmed instances (at extra cost).
- Minimize package size: Smaller deployment packages initialize faster.
- Use faster runtimes: Node.js and Python cold start faster than Java/C#.
- Move init code outside the handler: DB connections and SDK clients initialized at module level persist across invocations.
- Lambda SnapStart (Java): AWS-managed snapshot of initialized execution environment.
How do you reduce AWS costs in a cloud environment? What are your go-to strategies?
Cloud cost optimization is an ongoing practice. High-impact strategies:
- Right-sizing: Use AWS Cost Explorer and Compute Optimizer to identify oversized EC2 instances.
- Reserved Instances/Savings Plans: Commit to 1-3 years for stable workloads — saves up to 72%.
- Spot Instances: Use for stateless, fault-tolerant, or batch workloads. Up to 90% savings.
- S3 Lifecycle policies: Auto-transition to cheaper storage tiers.
- Delete idle resources: Audit unused EIPs, old snapshots, unattached EBS volumes.
- Auto Scaling: Scale down to zero or minimum outside business hours.
How does IAM assume-role work and how do you implement cross-account access securely?
Cross-account access uses the sts:AssumeRole API. A role in Account B has a trust policy that allows Account A to assume it:
# Trust policy on role in Account B
{
"Principal": {
"AWS": "arn:aws:iam::ACCOUNT_A_ID:root"
},
"Action": "sts:AssumeRole"
}
Account A’s entity calls aws sts assume-role to get temporary credentials (up to 12 hours) for Account B. Security controls:
- Add ExternalId condition for third-party access (prevents confused deputy attacks)
- Add MFA condition for sensitive roles
- Use SCPs at the AWS Organization level to restrict what can be assumed
How would you architect a highly available, multi-region AWS deployment?
Multi-region HA involves several layers:
- DNS: Route53 with health checks and latency/failover routing policies to direct users to the nearest healthy region.
- Data replication: RDS Multi-Region Read Replicas with promotion capability. DynamoDB Global Tables for active-active.
- Edge: CloudFront CDN with origins in multiple regions.
- Infrastructure: Identical infrastructure in each region managed by Terraform.
- DR strategy: Define RTO (Recovery Time Objective) and RPO (Recovery Point Objective) to determine your architecture (Pilot Light, Warm Standby, or Active-Active).
Explain the Terraform resource lifecycle and meta-arguments like create_before_destroy.
The lifecycle block gives you fine-grained control over how Terraform manages resource replacement:
resource "aws_instance" "web" {
ami = "ami-12345"
instance_type = "t3.medium"
lifecycle {
create_before_destroy = true # New instance created before old one is destroyed
ignore_changes = [ami] # Ignore external AMI changes
prevent_destroy = true # Block accidental deletion
}
}
create_before_destroy is critical for zero-downtime replacements. Without it, Terraform destroys the old resource first, creating a gap in availability.
What is the purpose of terraform.tfvars files?
terraform.tfvars files provide values for your declared variables, keeping configuration separate from the variable definitions. This allows you to have different values per environment without modifying the core modules.
# variables.tf — defines the variable
variable "instance_type" {
description = "EC2 instance type"
type = string
}
# production.tfvars — provides the value
instance_type = "c5.2xlarge"
# development.tfvars
instance_type = "t3.micro"
Never commit .tfvars files containing sensitive values to Git. Use .gitignore and pass sensitive values via environment variables (TF_VAR_*) in CI/CD.
How do you implement Terraform in a CI/CD pipeline safely?
Running Terraform in CI/CD requires careful guardrails:
- PR triggers plan: On every pull request, run
terraform planand post the output as a PR comment (using tools like Atlantis orterraform-pr-commenter). - Merge triggers apply: Only apply after PR is merged to main. Require manual approval for production.
- State locking: Ensure DynamoDB locking is configured to prevent concurrent applies.
- OIDC credentials: Use OIDC to get short-lived tokens from AWS instead of storing long-lived access keys.
- Plan artifacts: Save the plan file and apply that exact file — never re-plan at apply time.
What are Terraform data sources and how do they differ from resources?
A resource creates, updates, or destroys infrastructure. A data source reads existing infrastructure that is managed outside of your current Terraform code — it is read-only.
# Data source — reads an existing VPC by tag, does not create it
data "aws_vpc" "main" {
tags = {
Environment = "production"
}
}
# Use the data source output
resource "aws_subnet" "app" {
vpc_id = data.aws_vpc.main.id
...
}
Data sources are essential for referencing shared infrastructure managed by a different team or Terraform root module.
What is Terraform state drift and how do you handle it?
State drift occurs when the real infrastructure differs from what Terraform state believes it to be — typically due to manual changes made in the AWS console or another tool.
Detection: terraform plan will show changes that seem unexpected.
Resolution options:
- Import:
terraform importto import manually created resources into state. - Refresh:
terraform refreshto update state to match reality (deprecated in favor ofplan -refresh-only). - Accept drift: Use
lifecycle { ignore_changes = [...] }for intentionally externally-managed attributes.
Prevention: Forbid all manual console access to production environments using IAM SCPs.
What does terraform plan do and why should you always review it before applying?
terraform plan creates an execution plan — a preview of what Terraform will do before it actually makes changes. It shows additions, modifications, and destructions.
Always review the plan because:
- It may show unexpected destructions (e.g., a stateful database being replaced instead of modified)
- It catches misconfiguration before real infrastructure is affected
- In a CI/CD pipeline, save the plan output and apply that exact plan in the next step to ensure consistency
terraform plan -out=tfplan
terraform apply tfplan
How do you handle database migrations in a CI/CD pipeline without downtime?
Database migrations are one of the riskiest parts of deployment. The golden rule: migrations must be backward-compatible because during a rolling deploy, old code and new code run simultaneously.
Safe migration checklist:
- Never: Rename or drop a column in the same deploy that uses the new name.
- Step 1: Add new column (nullable, backward-compatible).
- Step 2: Deploy code that writes to both old and new columns.
- Step 3: Migrate existing data.
- Step 4: Deploy code using only the new column.
- Step 5: Drop the old column.
What is the purpose of a staging environment and what tests should run there?
Staging is a production-mirror environment used to catch bugs that only appear with real data, full infrastructure, and realistic load — things unit tests can’t surface. Tests to run in staging:
- Integration tests: Real database connections, real API calls to third parties.
- E2E tests: Cypress, Playwright, or Selenium to simulate real user journeys.
- Smoke tests: Quick sanity checks that critical paths work after deployment.
- Performance tests: Load tests with k6 or Locust to catch regressions.
How do you implement a multi-environment deployment pipeline (dev → staging → prod)?
A professional multi-environment pipeline uses gates between stages:
- Build once: A single immutable artifact (Docker image with SHA tag) is promoted — never rebuilt.
- Deploy to Dev: Automatic on every merge to main.
- Deploy to Staging: Automatic after dev health checks pass. Run integration and smoke tests.
- Deploy to Prod: Manual approval gate + scheduled deployment window.
The key is that the same image moves through all environments. This ensures what you tested in staging is exactly what runs in production.
What is a pipeline artifact and what are common examples?
A pipeline artifact is any file produced by a CI/CD job that needs to be passed to downstream jobs or stored for later use.
Common examples:
- Compiled binary or JAR file (Java/Go)
- Built Docker image pushed to a registry
- Frontend build output (
dist/orbuild/folder) - Test reports and coverage reports
- SBOM (Software Bill of Materials) files
- Terraform plan output
How do you speed up slow CI pipelines?
Slow pipelines kill developer productivity. Key optimizations:
- Caching: Cache dependencies (node_modules, pip packages, Go modules) between runs.
- Parallelism: Split test suites and run jobs in parallel.
- Test selection: Only run tests affected by the changed code.
- Optimized Docker builds: Use layer caching and BuildKit.
- Self-hosted runners: Eliminate queue time and use faster hardware.
- Fail fast: Run linting and unit tests first; integration tests only if those pass.
What is GitOps and how does it differ from traditional CI/CD?
Traditional CI/CD: The pipeline has credentials and directly pushes deployments to environments (push-based).
GitOps: Git is the single source of truth for the desired state of your infrastructure and applications. An agent running in the cluster (like ArgoCD or Flux) continuously reconciles the actual state with the desired state in Git (pull-based).
Benefits of GitOps: Drift detection, audit trail in Git history, easy rollback (git revert), no outbound credentials needed in CI.
How do you structure a mono-repo CI/CD pipeline to avoid unnecessary builds?
In a monorepo with 20+ services, you must only trigger builds for services that actually changed. Strategies:
- Path filters: GitHub Actions
paths:filter to trigger workflows only when specific directories change. - Nx / Turborepo: Task runners with build graph awareness that skip unchanged services.
- git diff: Compare changed files against the base branch and only build affected services.
# GitHub Actions path filter
on:
push:
paths:
- "services/api/**"
- "shared/lib/**"
How do you implement automated rollback in a deployment pipeline?
Automated rollback is triggered when post-deployment health checks fail. A robust implementation:
- Health check gate: After deployment, poll the health endpoint for 2-3 minutes.
- Metric thresholds: Monitor error rate and p99 latency for 5 minutes post-deploy.
- Rollback trigger: If error rate exceeds a threshold, automatically re-deploy the previous image tag.
# Generic shell rollback logic
NEW_VERSION="v2.0"
PREV_VERSION="v1.9"
deploy $NEW_VERSION
if ! health_check_passes; then
echo "Rollback triggered"
deploy $PREV_VERSION
alert_pagerduty "Automatic rollback executed"
fi
Why do you use branch protection rules in a CI/CD workflow?
Branch protection rules on the main or production branch enforce quality gates before any code is merged:
- Require pull request reviews (at least 1-2 approvals)
- Require status checks to pass (CI build, tests, linting)
- Require branches to be up to date before merging
- Prevent force pushes and branch deletion
This ensures no untested or unreviewed code ever reaches production, which is the foundation of a trustworthy deployment pipeline.
What is the difference between a Blue/Green deployment and a Canary deployment?
Blue/Green: You maintain two identical environments. “Blue” is live, “Green” has the new version. You switch all traffic from Blue to Green at once. Rollback is instant — just switch back. Downside: doubles infrastructure cost.
Canary: You gradually shift traffic from the old version to the new one — e.g., 5% → 25% → 50% → 100%. You analyze metrics and errors at each stage. Slower but safer for catching issues that only appear under real production load.
How do you secure a CI/CD pipeline from supply chain attacks?
Supply chain attacks (like SolarWinds, XZ Utils) target the build pipeline itself. Defense layers:
- Pin action versions: Use commit SHA, not floating tags like
@v2.uses: actions/checkout@abc123 - SBOM generation: Generate a Software Bill of Materials at build time using Syft.
- Image signing: Sign images with Cosign (Sigstore). Verify signatures before deployment.
- Least privilege: GitHub Actions tokens should have minimal permissions. Set
permissions: read-allby default. - Dependency review: Use Dependabot or Renovate for automated dependency updates.
How do you implement secret management in a GitHub Actions pipeline?
Never hardcode secrets in your pipeline files. GitHub Actions provides an encrypted Secrets store:
- Go to Repository Settings → Secrets and Variables → Actions → New Repository Secret.
- Reference in your workflow:
${{ secrets.MY_SECRET }}
- name: Deploy to AWS
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: aws s3 sync ./dist s3://my-bucket
For more advanced use cases, use OIDC to get short-lived tokens from AWS/GCP instead of storing static credentials.
What is the difference between Continuous Integration, Continuous Delivery, and Continuous Deployment?
Continuous Integration (CI): Developers merge code frequently (multiple times a day). Every merge triggers an automated build and test run to catch integration issues early.
Continuous Delivery (CD): Every passing build is automatically prepared for release to production. A human approves the final deployment step.
Continuous Deployment: Extends Delivery — every passing build is automatically deployed to production with no human intervention.
What is Docker Compose and when would you use it?
Docker Compose is a tool for defining and running multi-container applications using a YAML file. It is ideal for local development and testing where you need to spin up interdependent services (app + database + cache) with a single command.
docker compose up -d
It handles networking (all services in the same file can reach each other by service name), volume management, and environment variables. For production orchestration, use Kubernetes instead.
How do Docker volumes differ from bind mounts?
Docker Volumes are managed by Docker, stored in /var/lib/docker/volumes/, and are the recommended way to persist data. They are portable, easy to back up, and work well with Docker Compose.
Bind Mounts map a specific host path directly into the container. They are useful in development to sync source code in real-time but are host-dependent and harder to manage in production.
# Volume (recommended for production)
docker run -v mydata:/app/data myapp
# Bind mount (recommended for development)
docker run -v $(pwd)/src:/app/src myapp
How do you scan Docker images for vulnerabilities in a CI/CD pipeline?
Image scanning should be a mandatory gate before pushing to production. Tools and integration steps:
- Trivy (Aqua): Fast, comprehensive, easy CI integration.
trivy image myapp:latest - Snyk: Deep dependency scanning with developer-friendly output.
- Docker Scout: Built into Docker Hub.
- Grype: From Anchore, works well with SBOM workflows.
# GitHub Actions example
- name: Scan image with Trivy
uses: aquasecurity/trivy-action@master
with:
image-ref: myapp:${{ github.sha }}
severity: CRITICAL,HIGH
exit-code: 1 # Fail the pipeline on critical vulnerabilities
Explain Docker layer caching and how it impacts build speed.
Docker builds images layer by layer. If a layer hasn’t changed since the last build, Docker reuses the cached version. The trick is layer ordering:
Bad: COPY all files first, then run npm install. Any code change invalidates the npm install cache.
Good: COPY package.json first, run npm install, then COPY the rest of the source. Dependency installation only re-runs when package.json changes.
# Optimized layer order
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build
What is the purpose of a PodDisruptionBudget (PDB) in Kubernetes?
A PodDisruptionBudget limits how many pods of a deployment can be unavailable simultaneously during voluntary disruptions like node drains, cluster upgrades, or scaling down.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: api
Without a PDB, a cluster upgrade could drain multiple nodes simultaneously and take down your entire service. With minAvailable: 2, Kubernetes ensures at least 2 pods are always running.
Explain Kubernetes network policies and how you would isolate a production namespace.
By default, all pods in a Kubernetes cluster can communicate with each other freely. NetworkPolicies are namespace-scoped firewall rules that control which pods can talk to which.
To enforce full isolation on a namespace, start by denying all ingress and egress, then selectively allow only what’s needed:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Then add specific allow rules for your database, monitoring agents, and DNS (port 53).
How does the Kubernetes Horizontal Pod Autoscaler (HPA) work?
HPA automatically scales the number of pod replicas based on observed metrics. The default metric is CPU utilization, but it also supports memory and custom metrics via the Metrics API.
kubectl autoscale deployment my-app --min=2 --max=10 --cpu-percent=60
The HPA controller checks metrics every 15 seconds (default) and adjusts replicas to maintain the target. For custom metrics, you can integrate tools like KEDA (Kubernetes Event-Driven Autoscaling) which can scale based on Kafka lag, SQS queue depth, and more.
How do you manage secrets securely in Kubernetes? What are the alternatives to plain Kubernetes Secrets?
Kubernetes Secrets are base64-encoded, not encrypted by default. For production, consider these approaches:
- Encryption at Rest: Enable
EncryptionConfigurationto encrypt secrets in etcd. - External Secrets Operator: Syncs secrets from AWS Secrets Manager, GCP Secret Manager, or HashiCorp Vault into Kubernetes Secrets automatically.
- HashiCorp Vault Agent Injector: Injects secrets directly into Pod filesystems without storing them in Kubernetes at all.
- Sealed Secrets: Encrypts secrets client-side so they are safe to commit to Git.
How do services in different namespaces communicate in Kubernetes?
All services in a Kubernetes cluster are reachable via DNS using the Fully Qualified Domain Name (FQDN):
<service-name>.<namespace>.svc.cluster.local
For example, a service named postgres in the production namespace is reachable at postgres.production.svc.cluster.local from any pod in any namespace. If NetworkPolicies are in place, you must explicitly allow cross-namespace traffic.
What is the difference between a StatefulSet and a Deployment?
Use a Deployment for stateless workloads (web servers, APIs) where any Pod is interchangeable. Use a StatefulSet for stateful workloads like databases that need:
- Stable, predictable network identities (pod-0, pod-1, etc.)
- Ordered, graceful deployment and scaling
- Stable persistent storage linked to each pod individually
Common examples: Kafka, ZooKeeper, Cassandra, PostgreSQL replicas.
How do you perform a zero-downtime rolling update in Kubernetes?
Kubernetes Deployments support RollingUpdate strategy by default. The key is configuring maxSurge and maxUnavailable correctly alongside working readiness probes.
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
With maxUnavailable: 0, Kubernetes will never take down an old Pod until the new one is healthy (as determined by its readiness probe). This guarantees zero downtime.
What is the difference between a Pod and a Deployment in Kubernetes?
A Pod is the smallest deployable unit in Kubernetes — it wraps one or more containers that share the same network and storage. However, Pods on their own are ephemeral.
A Deployment is a higher-level abstraction that manages Pods. It ensures a specified number of Pod replicas are running at all times, handles rolling updates, and allows rollbacks. You almost never create bare Pods in production; you use Deployments instead.
kubectl create deployment nginx --image=nginx:1.25 --replicas=3
Explain the role of ‘Sidecar’ containers in Kubernetes pod architecture.
A sidecar container is a secondary container that runs along with the main application container within the same pod. It is used to extend and enhance the functionality of the main container, such as by providing logging, monitoring, or proxy services.
What is a ‘StatefulSet’ and when should you use it over a ‘Deployment’ in Kubernetes?
A StatefulSet is used for stateful applications that require unique, persistent identities and stable network identifiers. Unlike Deployments, which are for stateless pods, StatefulSets manage pods that are not interchangeable and have sticky identities.
How do you implement Zero-Downtime deployments with Kubernetes Service objects?
Discuss RollingUpdate strategies, readiness probes, and the role of Service selectors in traffic routing during a rollout.
Troubleshooting Scenarios
Live system debugging, incident diagnostics, and latency resolution.
How do you troubleshoot disk space issues on a Linux server?
Systematic disk investigation:
# Step 1: Check overall disk usage
df -h
# Step 2: Find which directory is consuming space
du -sh /* 2>/dev/null | sort -rh | head -20
# Step 3: Drill down into the problem directory
du -sh /var/* | sort -rh | head -10
# Step 4: Find specific large files
find / -type f -size +500M 2>/dev/null
# Step 5: Check for deleted-but-open files still consuming inodes
lsof | grep deleted
Common causes: application logs not rotating, large core dumps, MySQL/Postgres WAL overflow, old Docker images/volumes.
How do you troubleshoot high CPU usage on a Linux server?
Systematic CPU investigation:
- top / htop: Identify the process consuming CPU. Note: is it user space or kernel (
%usvs%sy)? - ps aux –sort=-%cpu: Snapshot of top CPU consumers.
- perf top: See which kernel functions are hot.
- strace -p <PID>: Trace system calls to understand what a process is doing.
- vmstat 1: Observe context switches (
cs) and interrupts (in).
Common causes: runaway application bug, CPU-intensive query (full table scan), kernel work from high I/O (softirqs), insufficient CPU for the workload.
How do you implement on-call rotation and incident response in an SRE team?
A mature on-call process has these elements:
- Schedules: PagerDuty or OpsGenie for rotating on-call assignments with escalations.
- Runbooks: Every alert links to a runbook with investigation steps and common resolutions.
- Severity levels: P1 (major outage, wake anyone up) → P4 (low impact, business hours only).
- Incident channels: Dedicated Slack channel per incident. Assign Incident Commander, Communications Lead roles.
- Postmortems: Blameless postmortem for every P1/P2. Focus on system improvements, not blaming individuals.
- On-call health: Track toil. If engineers are getting paged more than 2-3 times per shift, the alert quality needs improvement.
What is an error budget and how do SRE teams use it?
An error budget is the allowable amount of unreliability in a service, derived from the SLO. If your SLO is 99.9% uptime, your error budget is 0.1% — about 43 minutes of downtime per month.
How teams use it:
- When error budget is healthy → deploy freely, take risks, ship features.
- When error budget is low → slow down deployments, prioritize reliability work.
- When budget is exhausted → freeze all non-critical deployments until reliability improves.
Error budgets create a shared language between product (wants to ship) and SRE (wants reliability). It’s objective, not political.
What are dangling Docker images and how do you clean them up?
Dangling images are layers that have no associated tag — they appear as <none>:<none> in docker images. They accumulate over time from rebuilds and waste disk space.
# List dangling images
docker images -f dangling=true
# Remove all dangling images
docker image prune
# Nuclear option — remove all unused images, containers, networks, volumes
docker system prune -a --volumes
In CI/CD pipelines, always run docker system prune -f as a post-step to keep agents clean.
How do you troubleshoot high memory usage causing OOMKilled events in production?
When a container exceeds its memory limit, the kernel OOM killer terminates it and Kubernetes logs OOMKilled. Steps to resolve:
- Identify:
kubectl describe pod <pod>— look forReason: OOMKilledin Last State. - Profile: Use
kubectl top podor Prometheus/Grafana to understand actual memory usage patterns. - Fix: Either increase limits if the app genuinely needs more memory, or find and fix the memory leak in the application code.
- Prevent: Set up PrometheusRule or Datadog alerts to notify before a pod hits its limit.
What are resource requests and limits in Kubernetes, and why are they important?
Requests tell the Kubernetes scheduler how much CPU/memory to reserve for a pod when scheduling it onto a node. Limits are the hard caps — the container is throttled (CPU) or killed (memory) if it exceeds them.
resources:
requests:
memory: "128Mi"
cpu: "250m"
limits:
memory: "256Mi"
cpu: "500m"
Always set both. Without requests, the scheduler cannot make good placement decisions. Without limits, a runaway container can starve other workloads on the same node (the “noisy neighbor” problem).
How do you debug a pod stuck in CrashLoopBackOff?
CrashLoopBackOff means the container starts but repeatedly crashes. Use this systematic approach:
- Check logs:
kubectl logs <pod> --previousto see the crash output. - Describe the pod:
kubectl describe pod <pod>to inspect Events, resource limits, and probe failures. - Check OOM: If you see
OOMKilled, the container exceeded its memory limit. - Shell override: Override the entrypoint to keep the container alive for inspection:
command: ["sleep", "3600"]