All DevOps Interview Questions

Browse our comprehensive question bank. Updated regularly with real interview scenarios.

Switch Topic:

Beginner Questions

Core concepts, syntax, and foundational command-line knowledge.

Easy Associate Level System Design

What is the difference between authentication and authorization?

Authentication (AuthN): Verifying the identity of a user or service. “Who are you?” Authentication happens first — you prove your identity with a password, token, certificate, or biometric.

Authorization (AuthZ): Determining what an authenticated identity is allowed to do. “What can you do?” Authorization happens after authentication — once we know who you are, we check your permissions.

Example in AWS: You authenticate to AWS with your access key (AuthN). Then AWS checks your IAM policies to see if you’re authorized to call s3:PutObject (AuthZ). Both can fail independently.

Easy Associate Level System Design

What is multi-factor authentication (MFA) and why should it be enforced for cloud accounts?

MFA requires two or more verification factors: something you know (password) + something you have (TOTP app, hardware key) + something you are (biometric). Even if a password is compromised, MFA prevents unauthorized access.

For AWS/cloud accounts:

Enforce MFA on the root account immediately and don’t use it routinely
Require MFA for IAM users via SCP or IAM policy condition
Use hardware MFA keys (YubiKey) for privileged accounts
Enable AWS Organizations SCPs to deny API calls unless MFA is present

Easy Associate Level System Design

What is TLS/SSL and why is it important for DevOps engineers to understand it?

TLS (Transport Layer Security) encrypts communication between clients and servers, preventing eavesdropping and man-in-the-middle attacks. It replaced the deprecated SSL protocol.

DevOps engineers encounter TLS in:

Configuring HTTPS for web services (Let’s Encrypt, ACM in AWS)
Kubernetes Ingress TLS termination
mTLS between microservices (Istio, Linkerd)
Certificate rotation — expired certs cause outages
Internal PKI for service-to-service auth

Automate certificate renewal with cert-manager in Kubernetes or AWS Certificate Manager. Never let certificates expire manually.

Easy Associate Level Linux

What is the purpose of /etc/hosts and how does DNS resolution work in Linux?

DNS resolution order in Linux (configured in /etc/nsswitch.conf):

/etc/hosts: Local overrides. Checked first. Maps hostnames to IPs without DNS lookup.
DNS servers (/etc/resolv.conf): The configured nameservers are queried via UDP port 53.

Common use cases for /etc/hosts: local development overrides, blocking domains by pointing to 127.0.0.1, testing service connectivity using a service name before DNS is configured. In containers, Kubernetes manages /etc/hosts via its own CoreDNS system.

Easy Associate Level Linux

What is the difference between processes and threads in Linux?

A process is an independent program in execution with its own memory space, file descriptors, and system resources. Creating a new process (fork()) is expensive.

A thread is a unit of execution within a process. Threads within the same process share the same memory space and open file descriptors, making communication between them fast. Thread creation is lighter than process creation.

In Linux, threads are implemented as “lightweight processes” and managed with the clone() system call. Tools like htop can show threads per process.

Easy Associate Level Linux

What is the difference between a hard link and a symbolic (soft) link in Linux?

Hard Link: A directory entry that points directly to the same inode as the original file. Both the original and the hard link are indistinguishable — deleting one doesn’t affect the other. Hard links cannot span filesystems or link to directories.

Symbolic (Soft) Link: A pointer to another file’s path. If the original is deleted, the symlink becomes a broken “dangling” link. Symlinks can cross filesystems and point to directories.

# Hard link
ln original.txt hardlink.txt

# Symbolic link
ln -s /etc/nginx/sites-available/mysite /etc/nginx/sites-enabled/mysite

Easy Associate Level Observability

What is the difference between monitoring and observability?

Monitoring is about tracking known failure modes. You define metrics and alerts for things you know can go wrong. It answers: “Is this thing I’m watching broken?”

Observability is about understanding system behavior from its outputs. It allows you to answer questions you didn’t think to ask beforehand — debugging novel failures you’ve never seen before.

Monitoring tells you something is wrong. Observability tells you why. You need both, but as systems grow more complex, observability becomes more critical for understanding emergent failures.

Easy Associate Level Observability

What are the three pillars of observability?

The three pillars of observability are:

Metrics: Numerical measurements aggregated over time (CPU usage, request rate, error rate). Good for dashboards and alerting on trends.
Logs: Timestamped records of discrete events. Good for debugging specific incidents and understanding what happened.
Traces: Records of a request’s journey through a distributed system. Essential for finding bottlenecks and understanding service dependencies in microservices.

Together they answer: Is something wrong? (metrics), What is wrong? (logs), Where and why is it wrong? (traces).

Easy Associate Level AWS

What is the AWS Shared Responsibility Model?

AWS and customers share security responsibilities — the line depends on the service type:

AWS is responsible for: Security “of” the cloud — physical data centers, hypervisors, networking hardware, managed service infrastructure.

You are responsible for: Security “in” the cloud — your operating systems, your application code, IAM configurations, data encryption, network configuration (VPC, security groups), and patching guest OS on EC2.

For managed services like RDS or Lambda, AWS takes on more responsibility (OS patching), but you still own IAM, data, and network controls.

Easy Associate Level AWS

What is the difference between S3 Standard, S3 Infrequent Access, and S3 Glacier?

AWS S3 offers storage classes with different cost/access tradeoffs:

Standard: High durability, low latency, high throughput. For frequently accessed data.
Standard-IA (Infrequent Access): Same latency as Standard but cheaper storage cost. Higher per-retrieval cost. Use for data accessed less than once a month.
Glacier Instant Retrieval: For archive data accessed a few times per year. Millisecond retrieval.
Glacier Deep Archive: Lowest cost. Retrieval takes 12 hours. Use for compliance/regulatory long-term retention.

Use S3 Lifecycle Policies to automatically transition objects between classes based on age.

Easy Associate Level AWS

What is the difference between IAM users, groups, roles, and policies in AWS?

Users: Individual identities for people or applications with long-term credentials (access key + secret).

Groups: Collections of users that share the same permissions. Manage permissions at group level, not individually.

Roles: Identities assumed temporarily by AWS services (EC2, Lambda), federated users, or cross-account access. No long-term credentials — they use short-lived tokens. This is the preferred approach.

Policies: JSON documents that define permissions. Attached to users, groups, or roles.

Best practice: Always use roles over users for AWS service authentication.

Easy Associate Level Terraform

What is Infrastructure as Code (IaC) and what are its main benefits?

Infrastructure as Code means managing and provisioning infrastructure through machine-readable configuration files instead of manual processes.

Key benefits:

Reproducibility: Spin up identical environments on demand.
Version control: Track all infrastructure changes in Git. Know who changed what and when.
Auditability: Compliance teams can review what infrastructure is being provisioned.
Self-documentation: The code is the documentation.
Disaster recovery: Re-create an entire environment from scratch in minutes.

Easy Associate Level Docker

What is the difference between Docker COPY and ADD instructions?

Both copy files into the image, but ADD has extra functionality that makes it unpredictable:

ADD can fetch files from a URL
ADD auto-extracts tar archives into the destination

Best practice: Always use COPY unless you specifically need the URL or auto-extraction features. COPY is explicit and predictable, which is better for reproducible builds.

Easy Associate Level Docker

What is the purpose of ENTRYPOINT vs CMD in a Dockerfile?

CMD provides default arguments for the container. It can be overridden by passing arguments to docker run.

ENTRYPOINT defines the fixed command that always runs. It cannot be overridden without --entrypoint flag.

Best practice: Use ENTRYPOINT for the executable and CMD for default arguments, making the container behave like a command-line tool:

ENTRYPOINT ["python", "app.py"]
CMD ["--port", "8080"]
# docker run myapp --port 9090  ← overrides CMD only

Easy Associate Level Docker

What is the difference between a Docker image and a Docker container?

A Docker image is a read-only template built from a Dockerfile. Think of it as a class definition. A container is a running instance of that image — a class instantiation. You can run many containers from the same image, each isolated from the others.

# Build an image
docker build -t my-app:1.0 .

# Run a container from that image
docker run -d -p 8080:80 my-app:1.0

Easy Associate Level Kubernetes

What is a ConfigMap and when would you use it over an environment variable?

A ConfigMap stores non-sensitive configuration data as key-value pairs. It decouples your configuration from your container image.

Use ConfigMaps over hardcoded env vars when:

Config needs to differ between environments (dev/staging/prod)
Multiple pods share the same configuration
You need to mount config as a file (e.g., nginx.conf, prometheus.yml)

What is the difference between metrics, logs, and traces in observability?

Observability is the ability to understand the internal state of a system by examining its external outputs. The three pillars of observability are metrics, logs, and traces.

Metrics

Metrics are numerical measurements collected over time. They represent the current state or behavior of a system in an aggregated form. Examples: CPU usage percentage, request count per second, error rate, memory usage, p99 latency.

Metrics are best for: Dashboards and alerting on system health. Detecting anomalies and trends over time. Capacity planning. Tools: Prometheus, Datadog, CloudWatch, Grafana (visualization).

Logs

Logs are timestamped records of discrete events that occurred in a system. They provide detailed context about what happened and when. Examples: Application error messages, HTTP access logs, audit trails, debug output.

Logs are best for: Debugging specific errors or incidents. Audit trails for compliance. Understanding event sequences. Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Loki, Splunk, CloudWatch Logs.

Traces

Traces follow a single request as it flows through distributed services, capturing the path and timing of each operation. A trace consists of spans – individual units of work with start time and duration. Trace IDs link all spans of a single request across services.

Traces are best for: Identifying bottlenecks in distributed systems. Understanding service dependencies. Debugging latency issues across microservices. Tools: Jaeger, Zipkin, AWS X-Ray, OpenTelemetry.

Using All Three Together

When an alert fires on a metric (e.g., high error rate), you look at logs to find the specific error messages, then use traces to see which service call failed and where the latency spike originated. OpenTelemetry is the open standard for collecting all three signal types across different languages and platforms.

Medium Senior Level

How does Linux file permissions system work and what is the chmod command?

Linux file permissions control who can read, write, and execute files. Permissions operate at three levels: owner, group, and others.

Permission Types

Read (r = 4): View file contents or list directory. Write (w = 2): Modify file or add/delete files in directory. Execute (x = 1): Run file as program or enter a directory.

Reading Permissions

ls -la displays: -rwxrw-r– where the first character is file type (- file, d directory, l symlink), then three groups of rwx for owner, group, and others.

chmod Command

Symbolic mode: chmod u+x file adds execute for owner; chmod g-w file removes group write; chmod o=r file sets others to read-only.

Octal mode: chmod 755 file sets owner=rwx(7), group=rx(5), others=rx(5). Common permissions: 644 for regular files (owner rw, others r), 755 for executables and directories, 600 for private keys.

chown and chgrp

chown user:group file: Change owner and group. chown -R user:group dir/: Recursively change ownership. chgrp devs file: Change group only.

Special Permissions

SetUID (s on owner execute): File runs with owner’s permissions (e.g., passwd command). SetGID (s on group execute): File runs with group permissions; on directories, new files inherit group. Sticky bit (t): Only owner or root can delete files in directory – used on /tmp to prevent users deleting each other’s files.

Medium Senior Level Linux

How does Linux process management work and what are the key commands for managing processes?

Linux process management involves controlling the lifecycle of programs running on a system. Every process has a unique PID (Process ID) and runs with specific user permissions.

Process States

Running (R): Process is actively using CPU. Sleeping (S): Waiting for an event (interruptible). Uninterruptible sleep (D): Waiting for I/O, cannot be killed. Stopped (T): Suspended via SIGSTOP or Ctrl+Z. Zombie (Z): Process completed but parent hasn’t read its exit status.

Key Process Management Commands

ps aux: List all running processes with CPU/memory usage. top or htop: Real-time process monitoring. pgrep nginx: Find process IDs by name. kill PID: Send SIGTERM (graceful shutdown). kill -9 PID: Force kill with SIGKILL. killall nginx: Kill all processes by name. nice -n 10 command: Start process with lower priority. renice -n 5 -p PID: Change priority of running process. nohup command &: Run command immune to hangups in background. jobs / fg / bg: Manage background jobs.

Process Signals

SIGTERM (15): Request graceful termination. SIGKILL (9): Force kill – cannot be caught or ignored. SIGHUP (1): Reload configuration, used by daemons. SIGINT (2): Interrupt from keyboard (Ctrl+C). SIGSTOP (19): Pause process. SIGCONT (18): Resume paused process.

systemd Service Management

systemctl start/stop/restart/status service: Manage services. systemctl enable/disable service: Control boot behavior. journalctl -u service -f: Stream service logs in real time. systemctl list-units –type=service: List all active services.

Medium Senior Level Terraform

What is the difference between count and for_each in Terraform?

Both count and for_each are Terraform meta-arguments that allow you to create multiple instances of a resource from a single resource block, but they work differently and have important trade-offs.

count

count creates resources by index (0, 1, 2, …). Resources are addressed as resource_type.name[index].

resource "aws_instance" "web" {
  count         = 3
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t2.micro"

  tags = {
    Name = "web-${count.index}"
  }
}

# Reference: aws_instance.web[0], aws_instance.web[1], aws_instance.web[2]

count Problem: Index-based destruction

If you delete element at index 1 from a list of 3, Terraform destroys index 1 AND recreates index 2 as the new index 1. This causes unnecessary resource churn.

for_each

for_each creates resources from a map or set of strings. Resources are addressed by their key.

# Using a set
resource "aws_iam_user" "users" {
  for_each = toset(["alice", "bob", "carol"])
  name     = each.key
}

# Using a map
resource "aws_instance" "web" {
  for_each = {
    prod = "t3.medium"
    dev  = "t3.small"
  }
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = each.value

  tags = {
    Name = "web-${each.key}"
  }
}

# Reference: aws_instance.web["prod"], aws_instance.web["dev"]

for_each Advantage: Key-based stability

Deleting “bob” from the list only destroys bob’s resource. Alice and Carol are unaffected.

Comparison

Feature	count	for_each
Addressing	By index [0,1,2]	By key [“name”]
Input type	Number	Map or Set
Deletion behavior	Index shift causes recreation	Key-based, stable
Dynamic values	Can’t use unknown at plan time	Map values must be known
Best for	Identical resources	Resources with different configs

When to Use Each

Use count when:

Creating identical resources (e.g., N identical EC2 instances)
The total count is determined by a simple number
Resources don’t need stable identity

Use for_each when:

Creating resources with different configurations
Working with a list/map of named resources
Stability of individual resources is important (prevents unnecessary destruction)

Medium Senior Level Kubernetes

What is the difference between a PersistentVolume and a PersistentVolumeClaim in Kubernetes?

PersistentVolumes (PV) and PersistentVolumeClaims (PVC) are Kubernetes abstractions for managing storage.

PersistentVolume (PV)

A PersistentVolume is a storage resource in the cluster provisioned by an administrator or dynamically created via StorageClass. It exists independently of any pod lifecycle.

Key properties:

Capacity: Size of the storage
Access Modes: ReadWriteOnce, ReadOnlyMany, ReadWriteMany
Reclaim Policy: Retain, Recycle, or Delete
StorageClass: Defines the provisioner

PersistentVolumeClaim (PVC)

A PVC is a request for storage by a user/application. It consumes PV resources similar to how pods consume node resources.

Example PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: standard

Binding Process

Admin creates PV (or StorageClass enables dynamic provisioning)
Developer creates PVC with storage requirements
Kubernetes binds PVC to a matching PV
Pod references the PVC as a volume

Dynamic Provisioning

With StorageClass, PVs are created automatically when a PVC is submitted, eliminating manual PV creation. This is the preferred approach in cloud environments (EBS, GCP PD, Azure Disk).

Medium Senior Level Kubernetes

What are Taints and Tolerations in Kubernetes and how do they control pod scheduling?

Taints and Tolerations are Kubernetes mechanisms that control which pods can be scheduled on which nodes.

What are Taints?

A taint is applied to a node and repels pods that do not have a matching toleration. Taints have three effects:

NoSchedule: Pod will not be scheduled on the node
PreferNoSchedule: Kubernetes tries to avoid scheduling the pod on the node
NoExecute: Pod is evicted if already running and not tolerating the taint

Example taint command:

kubectl taint nodes node1 key=value:NoSchedule

What are Tolerations?

Tolerations are applied to pods and allow the scheduler to place pods on nodes with matching taints.

Example toleration in a pod spec:

tolerations:
- key: "key"
  operator: "Equal"
  value: "value"
  effect: "NoSchedule"

Common Use Cases

Dedicated nodes: Taint GPU nodes so only GPU workloads run on them
Node maintenance: Taint nodes before draining to prevent new pod scheduling
Special hardware: Reserve nodes with SSDs or high memory for specific workloads
Multi-tenancy: Isolate team workloads on specific nodes

Key Difference from Node Affinity

Node Affinity attracts pods to nodes, while Taints repel pods from nodes. They complement each other for fine-grained scheduling control.

Medium Senior Level System Design

What is a WAF and when should you use AWS WAF vs Cloudflare?

A Web Application Firewall (WAF) filters and monitors HTTP traffic to protect against common attacks: SQL injection, XSS, DDoS, bad bots.

AWS WAF: Tight integration with CloudFront, ALB, API Gateway. Managed rule groups for OWASP, AWS managed rules. Good if you’re AWS-native. Can use IP reputation lists and rate-limiting rules.

Cloudflare: Operates at the DNS/edge level before traffic reaches AWS. Better DDoS mitigation due to Cloudflare’s massive global network. Simpler setup. Bot management is more mature.

In practice: Use Cloudflare as the outer layer for DDoS and global edge, then AWS WAF at the ALB for application-layer filtering. Defense in depth.

Medium Senior Level System Design

What is a CVE, and how do you track and remediate vulnerabilities in your infrastructure?

A CVE (Common Vulnerabilities and Exposures) is a public identifier for a known security vulnerability. Each CVE has a severity score (CVSS 0-10).

Tracking and remediation workflow:

Discovery: Continuous scanning — Trivy/Snyk in CI for container images, Dependabot for code dependencies, AWS Inspector for EC2.
Triage: Not all CVEs require immediate action. Prioritize by CVSS score, exploitability, and whether the vulnerable code path is actually used.
Remediation: Update base image, update dependency, or apply vendor patch.
Tracking: Log CVEs in your ticketing system with SLA (e.g., Critical = 24h, High = 7 days).

Medium Senior Level Linux

Explain file permissions in Linux (rwx, octal notation) and when to use sticky bit/setuid.

Linux file permissions have three sets: owner, group, others. Each can have: read (4), write (2), execute (1).

-rwxr-xr-- = 754
# Owner: rwx (7), Group: r-x (5), Others: r-- (4)

chmod 755 script.sh   # Standard executable
chmod 644 config.yml  # Standard config file

Special bits:

Sticky bit (1xxx): On directories (e.g., /tmp), only the file owner can delete their own files: chmod +t /shared
Setuid (4xxx): File executes with the owner’s permissions (used by /usr/bin/passwd to write /etc/shadow as root). Use with extreme caution.

Medium Senior Level Observability

How do you write effective Prometheus alerting rules?

Effective Prometheus alerts follow these principles:

groups:
- name: api-alerts
  rules:
  - alert: HighErrorRate
    expr: |
      rate(http_requests_total{status=~"5.."}[5m])
      / rate(http_requests_total[5m]) > 0.05
    for: 5m  # Must be true for 5 minutes before firing
    labels:
      severity: critical
    annotations:
      summary: "High error rate on {{ $labels.service }}"
      description: "Error rate is {{ $value | humanizePercentage }}"
      runbook: "https://wiki.internal/runbooks/high-error-rate"

Key practices: Use for to avoid alerting on momentary spikes. Always include a runbook link. Use human-readable messages with $labels and $value.

Medium Senior Level Observability

What is Prometheus and how does its pull-based model differ from push-based monitoring?

Prometheus is an open-source metrics monitoring system with a time-series database.

Pull-based (Prometheus): Prometheus actively scrapes metrics from targets at regular intervals. Targets expose a /metrics HTTP endpoint. Benefits: Prometheus controls the scraping schedule, easy to detect if a target is down, no credentials needed on the target side.

Push-based (StatsD, Graphite): Applications push metrics to a central collector. Better for short-lived jobs (like batch scripts) that may end before Prometheus scrapes them. Use Prometheus Pushgateway for these use cases.

Medium Senior Level Observability

What is an SLO, SLA, and SLI, and how do they relate to each other?

SLI (Service Level Indicator): An actual measurement of service behavior. Example: the percentage of successful HTTP requests.

SLO (Service Level Objective): The target for your SLI. Example: 99.9% of requests should succeed in the last 30 days.

SLA (Service Level Agreement): A contractual commitment to the SLO with defined consequences for missing it. Example: If availability drops below 99.9%, AWS credits customers.

In practice: define SLIs → set SLO targets → the SLA is what you promise externally. Your internal error budget is 100% - SLO.

Medium Senior Level Observability

What is the difference between a Prometheus Gauge, Counter, and Histogram metric type?

Counter: A cumulative value that only increases (or resets to zero on restart). Use for: total requests, total errors, bytes sent. Never use for values that can go down.

Gauge: A value that can go up or down. Use for: current memory usage, active connections, queue depth, temperature.

Histogram: Samples observations and counts them in configurable buckets. Use for: request latency, response sizes. Allows you to calculate percentiles (p50, p95, p99) — critical for SLOs.

Medium Senior Level AWS

What is AWS ECS and when would you choose it over EKS?

ECS (Elastic Container Service) is AWS’s native container orchestrator. EKS (Elastic Kubernetes Service) is managed Kubernetes.

Choose ECS when:

Your team is AWS-native and doesn’t have Kubernetes expertise
You want lower operational overhead (no Kubernetes control plane concepts to manage)
Tight AWS service integration is a priority (IAM roles per task, ALB integration is simpler)

Choose EKS when:

You need Kubernetes-native features (CRDs, Operators, Helm ecosystem)
You have multi-cloud or hybrid requirements
Your team already has Kubernetes expertise

Medium Senior Level AWS

Explain AWS VPC and its core components (subnets, route tables, IGW, NAT).

A VPC (Virtual Private Cloud) is your isolated network within AWS.

Subnets: Subdivisions of your VPC in a specific AZ. Public subnets have a route to the IGW; private subnets do not.
Route Tables: Rules defining where traffic is directed. A public subnet’s route table has 0.0.0.0/0 → IGW.
Internet Gateway (IGW): Allows public subnets to communicate with the internet.
NAT Gateway: Allows private subnets to make outbound internet requests (e.g., pulling packages) without exposing them to inbound internet traffic.

Medium Senior Level AWS

What is the difference between an AWS Security Group and a Network ACL?

Security Groups (SGs): Stateful firewalls at the instance level. If you allow inbound traffic, the corresponding outbound response is automatically allowed. Rules are allow-only (no deny rules).

Network ACLs (NACLs): Stateless firewalls at the subnet level. You must explicitly allow both inbound and outbound traffic. Rules are evaluated in order (by rule number) and support both allow and deny.

In practice: Use Security Groups for most use cases. Use NACLs as an additional layer for blocking specific IP ranges (e.g., blocking a bad actor’s IP at the subnet boundary).

Medium Senior Level Terraform

How do you handle sensitive values like passwords in Terraform without exposing them in state?

Terraform state files contain sensitive values in plaintext — this is a known limitation. Mitigations:

Mark as sensitive: sensitive = true on variables and outputs prevents them from appearing in CLI output.
Avoid storing in state: Use AWS Secrets Manager or Vault to generate and store secrets externally. Reference via data source or environment variable.
Encrypt state: S3 backend with server-side encryption (SSE-KMS).
Restrict access: The S3 bucket containing state should have strict IAM policies — only CI/CD roles should have access.

Medium Senior Level Terraform

How do Terraform modules work and what makes a good module?

A Terraform module is a reusable group of resource configurations. Every directory with .tf files is a module. You call modules from a root module to avoid repeating code.

What makes a good module:

Single responsibility: One module for VPC, another for EKS, another for RDS.
Parameterized: Accept variables to customize behavior per environment.
Versioned: Pin module versions in the source attribute.
Outputs: Expose useful outputs (VPC ID, subnet IDs) for other modules to consume.

Medium Senior Level Terraform

What is Terraform state and why must it be stored remotely in a team environment?

Terraform state is a JSON file (terraform.tfstate) that maps your configuration to real-world resources. Terraform uses it to know what already exists before planning changes.

Storing it locally breaks team collaboration:

Team members would each have different state files causing conflicts
State file gets lost if the local machine breaks
No locking mechanism — two engineers could run apply simultaneously and corrupt state

Remote backends (S3 + DynamoDB for locking, GCS, Terraform Cloud) solve all three problems.

terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-lock"
  }
}

Medium Senior Level Docker

Explain the concept of a distroless image and its security benefits.

A distroless image contains only your application and its runtime dependencies — no shell, no package manager, no OS utilities. This comes from Google’s distroless project.

Security benefits: You cannot exec into a distroless container and run arbitrary commands. The attack surface is dramatically reduced because there are no standard Unix tools an attacker could use to move laterally.

# Distroless multi-stage example
FROM golang:1.22 AS builder
WORKDIR /app
COPY . .
RUN CGO_ENABLED=0 go build -o server .

FROM gcr.io/distroless/static-debian12
COPY --from=builder /app/server /server
CMD ["/server"]

Medium Senior Level Docker

How do you reduce Docker image size? Walk through your optimization strategy.

Image size directly affects pull times and attack surface. Key strategies:

Use minimal base images: alpine or distroless instead of ubuntu.
Multi-stage builds: Build in a full image, copy only the binary/artifact to a slim final image.
Combine RUN commands: Each RUN creates a layer. Chain commands with && and clean up in the same layer.
Use .dockerignore: Exclude node_modules, .git, test files from the build context.

# Multi-stage example
FROM node:20 AS builder
WORKDIR /app
COPY . .
RUN npm ci && npm run build

FROM node:20-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
CMD ["node", "dist/index.js"]

Medium Senior Level Kubernetes

Explain the difference between a Liveness probe, Readiness probe, and Startup probe.

Liveness Probe: Checks if the container is alive. If it fails, Kubernetes restarts the container. Use this to recover from deadlocks.

Readiness Probe: Checks if the container is ready to serve traffic. If it fails, the pod is removed from Service endpoints (no traffic sent). Use this during slow startup or when temporarily overloaded.

Startup Probe: Only runs at startup. Allows slow-starting containers enough time to initialize before liveness checks begin. Prevents liveness probes from killing a pod that is simply starting up slowly.

Medium Senior Level Kubernetes

What is a Kubernetes Ingress and how does it differ from a Service?

A Service exposes a set of pods internally or as a simple LoadBalancer. An Ingress is a Layer-7 (HTTP/HTTPS) routing rule that sits in front of multiple services and routes traffic based on hostname or path.

Example: Route api.example.com to the api-service and example.com to the frontend-service using a single load balancer IP. This is far more cost-effective than having a separate LoadBalancer service for each microservice.

Advanced Questions

Enterprise orchestration, deep architectural concepts, and scaling issues.

Hard Lead / Architect Level System Design

Explain the OWASP Top 10 and which items are most relevant to DevOps engineers.

The OWASP Top 10 are the most critical web application security risks. Most relevant to DevOps:

A01: Broken Access Control — Enforce least privilege in IAM, K8s RBAC. Verify RBAC policies in code review.
A05: Security Misconfiguration — Public S3 buckets, default credentials, exposed management ports. Caught by infrastructure scanning tools like Checkov, tfsec.
A06: Vulnerable Components — Use Dependabot and Trivy to catch outdated dependencies with known CVEs.
A09: Security Logging Failures — Ensure CloudTrail, K8s audit logs, and application audit logs are enabled and shipped to a SIEM.

Hard Lead / Architect Level Linux

What is a Load Average in Linux and how do you interpret it?

Load average in top or uptime shows three numbers: 1-minute, 5-minute, and 15-minute averages of the number of processes in a runnable or uninterruptible state.

Interpretation depends on the number of CPU cores. On a 4-core server:

Load average of 4.0 = 100% utilization — every CPU busy but nothing waiting
Load average of 8.0 = 200% utilization — 4 CPUs busy, 4 processes waiting in queue
Load average of 0.5 = 12.5% utilization — plenty of headroom

Key insight: High load average is NOT always CPU. Uninterruptible sleep (disk I/O wait) also counts. Check iostat to distinguish CPU saturation from I/O saturation.

Hard Lead / Architect Level Linux

What are Linux namespaces and cgroups, and how do they enable container isolation?

Namespaces provide isolation for system resources so each container sees its own view of the system:

pid — isolated process tree (container sees its own PIDs starting at 1)
net — isolated network stack (own IP, routing table)
mnt — isolated filesystem mounts
uts — isolated hostname
user — isolated user/group IDs

cgroups (Control Groups) limit and account for resource usage (CPU, memory, I/O) per group of processes. This is how Docker enforces your CPU/memory limits.

Together: namespaces provide isolation (what can be seen), cgroups provide resource limits (how much can be used).

Hard Lead / Architect Level AWS

How do you implement least-privilege IAM policies and why is it critical?

Least-privilege means granting only the exact permissions needed to perform a task — no more. This limits blast radius if credentials are compromised.

Implementation steps:

Start with deny-all, add allows: Begin with minimal permissions and add only what’s needed.
IAM Access Analyzer: Use to identify unused permissions and generate least-privilege policies based on CloudTrail logs.
Policy conditions: Add StringEquals conditions to restrict resources by tag, region, or account.
Permission boundaries: Cap the maximum permissions a principal can have, even if attached policies are more permissive.

"Condition": {
  "StringEquals": {
    "aws:RequestedRegion": "us-east-1"
  }
}

Hard Lead / Architect Level Terraform

What are Terraform providers and how do you handle provider version pinning?

Providers are plugins that translate Terraform configuration into API calls to AWS, GCP, Azure, etc. Always pin provider versions to prevent unexpected changes from provider upgrades:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"  # Allows 5.x but not 6.x
    }
  }
  required_version = ">= 1.7.0"
}

provider "aws" {
  region = "us-east-1"
}

Run terraform providers lock to generate a .terraform.lock.hcl file that locks exact versions and checksums. Commit this file to Git.

Hard Lead / Architect Level Terraform

How do you manage multiple environments (dev/staging/prod) in Terraform? Workspaces vs. directory structure.

Two main approaches:

Terraform Workspaces: Use the same code but switch workspace to change state. Simple, but the same code runs for all environments — hard to have different variable values per environment. Suitable for simple differences.

Separate Directories (recommended): Each environment has its own directory with its own terraform.tfvars and remote state. This is explicit, auditable, and allows environments to diverge safely.

environments/
  dev/
    main.tf → calls shared module
    terraform.tfvars
  staging/
    main.tf
    terraform.tfvars
  prod/
    main.tf
    terraform.tfvars
modules/
  vpc/
  eks/

Hard Lead / Architect Level Docker

How do you implement health checks in Docker and why are they important for orchestration?

The HEALTHCHECK instruction tells Docker how to test if a container is working correctly. Without it, Docker considers a container healthy as soon as the process starts — even if the app inside has crashed.

HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD curl -f http://localhost:8080/health || exit 1

In Kubernetes, this is replaced by Liveness and Readiness probes. In Docker Compose or standalone Docker, HEALTHCHECK is critical for orchestration tools to know whether to send traffic to a container.

Hard Lead / Architect Level Docker

How would you run containers as a non-root user for security hardening?

Running containers as root is a significant security risk. If an attacker escapes the container, they have root on the host. Harden your images:

FROM node:20-alpine

# Create a non-root user
RUN addgroup -S appgroup && adduser -S appuser -G appgroup

# Set working directory and permissions
WORKDIR /app
COPY --chown=appuser:appgroup . .

# Switch to non-root user
USER appuser

CMD ["node", "index.js"]

Also enforce this at the Kubernetes level with a SecurityContext: runAsNonRoot: true.

Hard Lead / Architect Level Kubernetes

Explain Kubernetes RBAC and how you would give a service account read-only access to pods.

RBAC (Role-Based Access Control) is the authorization mechanism in Kubernetes. It uses three objects:

Role/ClusterRole: Defines what actions are allowed on which resources.
ServiceAccount: An identity for pods or external tools.
RoleBinding/ClusterRoleBinding: Links a ServiceAccount to a Role.

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
subjects:
- kind: ServiceAccount
  name: my-service-account
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

How do you design a highly available and scalable microservices architecture?

Designing a highly available and scalable microservices architecture requires addressing availability, scalability, resilience, and observability at each layer.

Availability Patterns

Multiple instances: Run at least 3 replicas of each service across multiple availability zones. Health checks: Liveness probes restart unhealthy containers; readiness probes stop traffic to unready instances. Circuit breakers: Prevent cascade failures by stopping calls to failing services (e.g., Hystrix, Resilience4j). Load balancing: Distribute traffic evenly across instances at layer 7 with health-aware routing.

Scalability Patterns

Horizontal scaling: Add more instances rather than bigger machines. Auto-scaling based on CPU, memory, or custom metrics (e.g., queue depth). Stateless services: Store state in external data stores (Redis, databases) so any instance can handle any request. Database sharding and read replicas to scale data layer. CDN and caching layers to reduce backend load.

Resilience Patterns

Bulkhead pattern: Isolate failures to prevent one service from exhausting shared resources. Retry with exponential backoff: Handle transient failures gracefully. Timeout configuration: Prevent slow services from blocking the entire request chain. Graceful degradation: Return partial results or cached data when a dependency fails.

Service Communication

Synchronous: REST or gRPC for real-time request/response. Use API Gateway for external traffic routing, auth, and rate limiting. Asynchronous: Message queues (Kafka, RabbitMQ, SQS) for event-driven communication, decoupling producers from consumers.

Observability

Distributed tracing with correlation IDs. Centralized logging with structured log format. Per-service metrics with SLO-based alerting. Service mesh (Istio, Linkerd) for traffic management, mTLS, and observability.

Medium Senior Level Azure

What are the different Azure storage services and when should you use each?

Azure provides multiple storage services optimized for different data types and access patterns.

Azure Blob Storage: Object storage for unstructured data like images, videos, backups, and logs. Three access tiers: Hot (frequent access), Cool (infrequent, 30-day minimum), and Archive (rare access, 180-day minimum). Supports lifecycle management to auto-tier data. Equivalent to Amazon S3 or GCS.

Azure File Storage: Fully managed file shares via SMB and NFS protocols. Can be mounted on Windows, Linux, and macOS. Use for lifting on-premises file servers, shared config files, or Azure Files Sync for hybrid scenarios.

Azure Table Storage: NoSQL key-value store for semi-structured data. Serverless, auto-scaling. Good for large amounts of structured non-relational data at lower cost than Cosmos DB.

Azure Queue Storage: Message queue service for decoupling components. Messages up to 64KB. Use for async task processing and reliable messaging between services.

Azure Disk Storage: Persistent block storage for Azure VMs. Options include Premium SSD, Standard SSD, Standard HDD, and Ultra Disk. Managed disks are recommended as Azure handles availability and replication.

Azure Data Lake Storage Gen2: Hierarchical filesystem built on Blob Storage, optimized for analytics with Hadoop-compatible access. Used with Azure Databricks, HDInsight, and Synapse Analytics.

Medium Senior Level Azure

What is Azure DevOps and how does it support CI/CD pipelines?

Azure DevOps is a suite of developer services for planning, developing, testing, and delivering software. It provides integrated DevOps toolchain capabilities for Azure and third-party platforms.

Azure DevOps Services

Azure Boards: Agile project management with Kanban boards, backlogs, sprints, and work item tracking supporting Scrum and CMMI methodologies.

Azure Repos: Git repositories with branch policies, pull request workflows, and code review. Supports both Git and TFVC.

Azure Pipelines: CI/CD platform that builds, tests, and deploys code to any language, platform, and cloud. Runs on Microsoft-hosted or self-hosted agents.

Azure Test Plans: Manual and automated test management.

Azure Artifacts: Package management supporting NuGet, npm, Maven, Python, and Universal Packages.

CI/CD with Azure Pipelines

Pipelines are defined in YAML (azure-pipelines.yml) and consist of triggers, stages, jobs, and steps.

A typical pipeline structure:

trigger: branches to watch (e.g., main)
pool: agent image (ubuntu-latest, windows-latest)
stages: Build stage builds and pushes Docker image; Deploy stage deploys to AKS using kubectl or Helm

Key Features

Environments with deployment gates and approval workflows protect production. Variable groups and Azure Key Vault integration manage secrets securely. Service connections connect to Azure, AWS, GCP, and other external services. Matrix builds enable parallel testing across multiple OS/runtime combinations. Integration with GitHub, Jira, Slack, and Microsoft Teams for notifications and collaboration.

Medium Senior Level Azure

What is Azure Active Directory (Azure AD) and how does it differ from on-premises Active Directory?

Azure Active Directory (Azure AD, now rebranded as Microsoft Entra ID) is Microsoft’s cloud-based identity and access management service. It handles authentication and authorization for Azure resources, Microsoft 365, and thousands of third-party SaaS applications.

Azure AD vs On-Premises Active Directory

On-premises AD DS uses Kerberos and NTLM protocols, is structured around OUs, domains, and forests, and uses LDAP for querying and Group Policy for management. It is designed for traditional Windows environments.

Azure AD uses OAuth2, OpenID Connect, and SAML. There are no OUs, forests, or Kerberos by default. It provides SSO across cloud apps and supports modern identity scenarios like MFA, Conditional Access, and Identity Protection.

Key Azure AD Concepts

Tenants: An isolated instance of Azure AD representing an organization. Each Azure subscription is associated with one tenant.

App Registrations: Applications register with Azure AD to get credentials for OAuth2 authentication flows.

Service Principals: Identities for applications and automation to authenticate with Azure resources.

Managed Identities: Azure-managed identities for Azure resources like VMs, App Service, and AKS that eliminate the need for storing credentials in code. System-assigned identities follow resource lifecycle; user-assigned identities have independent lifecycle.

Conditional Access: Policy-based access controls evaluating sign-in risk, device compliance, location, and other signals to grant or block access.

Azure AD Connect: Synchronizes on-premises AD DS identities to Azure AD for hybrid identity scenarios.

Medium Senior Level Azure

What is Azure Kubernetes Service (AKS) and how do you deploy and manage a cluster?

Azure Kubernetes Service (AKS) is a managed Kubernetes service that simplifies deploying, managing, and scaling containerized applications on Azure. Microsoft manages the control plane (API server, etcd, scheduler) while you manage the worker nodes.

Key AKS Concepts

Node Pools: Groups of VMs with the same configuration. AKS supports system node pools for system pods like CoreDNS and user node pools for application workloads. Multiple node pools can have different VM sizes and scaling settings.

Virtual Nodes: AKS integrates with Azure Container Instances via virtual nodes, enabling burst scaling to serverless containers without managing additional VMs.

Networking: AKS supports Kubenet (simple, assigns IPs from separate address space) and Azure CNI (assigns IPs directly from VNet for better performance and Azure service integration).

Deploying an AKS Cluster

az aks create –resource-group myRG –name myAKS –node-count 3 –enable-addons monitoring –generate-ssh-keys
az aks get-credentials –resource-group myRG –name myAKS
kubectl get nodes

Key AKS Features

Managed control plane with automatic upgrades and patching. Azure AD integration for RBAC authentication. Azure Monitor and Container Insights for observability. Azure Policy for compliance. Cluster autoscaler for automatic node scaling. Azure Disk and Azure Files for persistent volumes. Private cluster option restricts API server access to VNet. Integration with Azure DevOps and GitHub Actions for CI/CD pipelines.

Medium Senior Level Azure

What is Azure Resource Manager (ARM) and how does it differ from the classic deployment model?

Azure Resource Manager (ARM) is the deployment and management service for Azure. It provides a consistent management layer that enables you to create, update, and delete resources in your Azure subscription using infrastructure as code.

ARM vs Classic Deployment Model

The Classic deployment model (also called Azure Service Manager or ASM) was the original Azure deployment system. It treated resources individually and lacked the ability to manage them as a group. ARM replaced it with a resource-group-based approach.

Key differences: In ARM, resources are organized into Resource Groups – logical containers for resources that share the same lifecycle. Classic had no resource grouping. ARM enables declarative templates (ARM templates or Bicep) to define infrastructure. Classic required scripting each resource individually. ARM supports role-based access control (RBAC) at the resource, resource group, or subscription level. Classic had limited access controls. ARM tracks dependencies between resources and deploys them in the correct order. ARM supports tags on resources for cost tracking and organization.

ARM Templates

ARM templates are JSON files that define the infrastructure and configuration for your project. They follow an idempotent deployment model where you define the desired state and ARM ensures the environment matches. Bicep is a domain-specific language (DSL) that compiles to ARM templates and provides cleaner syntax.

Resource Groups

A resource group is a logical container where Azure resources are deployed and managed. All resources in a group share the same lifecycle – you can deploy, update, or delete them together. Resources in the same group can be in different regions. Resource groups enable cost management and access control at a group level.

Medium Senior Level GCP

How does GCP VPC networking work and what are Shared VPC and VPC peering?

Google Cloud VPC (Virtual Private Cloud) is a global, private network that provides connectivity for GCP resources. Unlike AWS VPCs which are regional, GCP VPCs are global by default with subnets in specific regions.

GCP VPC Key Characteristics

Global VPC: A single VPC spans all GCP regions. Resources in the same VPC can communicate across regions using internal IPs without extra configuration.

Subnets: Regional resources with a defined CIDR range. Two modes exist: auto mode auto-creates subnets in each region, custom mode gives full control over all subnets.

Firewall Rules: Applied at the VPC level using tags or service accounts to target instances. Rules are stateful. Unlike AWS, there are no network ACLs – all filtering is done through firewall rules.

Shared VPC

Shared VPC allows a host project to share its VPC network with service projects. Multiple projects share the same networking while keeping workloads isolated per project. The host project owns and manages the VPC, subnets, and firewall rules while service projects deploy resources into the shared subnets.

Use Shared VPC for centralized network administration, consistent firewall policy enforcement, and simplifying inter-project connectivity within an organization.

VPC Peering

VPC Peering connects two VPCs so resources can communicate using internal IPs without routing through the public internet. Peering works across projects and organizations. Peering is non-transitive: if VPC A peers with B and B peers with C, A cannot reach C through B.

Use VPC Peering for connecting VPCs in different projects or organizations, sharing services privately, and achieving lower latency compared to external routing.

Medium Senior Level GCP

What is BigQuery and how does it differ from traditional relational databases?

BigQuery is Google Cloud’s fully managed, serverless data warehouse designed for large-scale analytics. It can query petabytes of data in seconds using SQL without requiring infrastructure management.

Key Architecture Differences

BigQuery uses columnar storage (Capacitor format) which makes analytical queries fast by reading only relevant columns rather than entire rows. Traditional RDBMS use row-based storage optimized for transactional workloads with frequent single-record reads and writes.

BigQuery separates compute and storage, allowing each to scale independently. Traditional databases tightly couple compute and storage on the same server.

BigQuery uses a distributed query engine (Dremel) that automatically parallelizes queries across thousands of nodes. Traditional databases are typically single-node or manually sharded.

BigQuery vs Traditional Databases

BigQuery excels at OLAP workloads: aggregations, joins across billions of rows, analytics dashboards. Traditional RDBMS (PostgreSQL, MySQL) excel at OLTP: fast single-row inserts/updates with ACID transactional guarantees.

BigQuery requires no indexes, vacuuming, or schema optimization. Pricing is per-query (bytes scanned) or flat-rate. Traditional databases require DBA management, index tuning, and ongoing optimization.

Key BigQuery Features

Partitioning by date, range, or ingestion time reduces scan costs. Clustering on filtered columns improves query performance. BigQuery ML runs machine learning models using SQL. Streaming inserts via Storage Write API. Native integration with Dataflow, Pub/Sub, Looker, and Data Studio.

Medium Senior Level GCP

What are the different storage classes in Google Cloud Storage and when should you use each?

Google Cloud Storage (GCS) is a unified object storage service for unstructured data. It offers four storage classes optimized for different access patterns and cost requirements.

Standard Storage: Best for frequently accessed or hot data. No minimum storage duration. Highest storage cost, lowest access cost. Use for active website content, mobile apps, gaming assets, and data analytics requiring low latency.

Nearline Storage: Best for data accessed less than once per month. 30-day minimum storage duration. Lower storage cost than Standard with a small retrieval fee. Use for backups, long-tail multimedia content, and monthly-accessed archives.

Coldline Storage: Best for data accessed less than once per quarter. 90-day minimum storage duration. Very low storage cost with higher retrieval fee. Use for disaster recovery, compliance archives, and infrequently accessed backups.

Archive Storage: Best for data accessed less than once per year. 365-day minimum storage duration. Lowest storage cost with highest retrieval fee. Use for long-term preservation, regulatory compliance data, and cold archival storage.

Key GCS Features

All storage classes share the same API. Object versioning supports recovery from accidental deletions. Lifecycle policies automate transitions between storage classes. Strong consistency for all read/write operations. Object Lock and retention policies for compliance. Signed URLs provide time-limited access without authentication.

Medium Senior Level GCP

What is Google Cloud Pub/Sub and how does it differ from traditional message queues?

Google Cloud Pub/Sub is a fully managed, real-time messaging service that enables asynchronous communication between independent applications at scale. It follows the publish-subscribe pattern where publishers send messages to topics and subscribers receive messages from subscriptions.

Core Concepts

Topic: A named resource to which publishers send messages. Subscription: A named resource representing the stream of messages from a single, specific topic. Publisher: Application that creates and sends messages to a topic. Subscriber: Application that receives messages from a subscription.

Delivery Models

Pull delivery: Subscriber explicitly calls an API to retrieve messages. Suitable for batch processing and when the subscriber controls the rate. Push delivery: Pub/Sub sends messages to a webhook endpoint. Suitable for real-time processing and serverless architectures.

How Pub/Sub Differs from Traditional Message Queues

Traditional queues like RabbitMQ or SQS use a point-to-point model where each message is consumed by one consumer. Pub/Sub supports fan-out natively – one message can be delivered to multiple subscriptions simultaneously.

Pub/Sub is fully serverless and scales automatically to millions of messages per second. It integrates natively with Dataflow, BigQuery, Cloud Storage, and Cloud Functions for streaming pipelines.

Key Use Cases

Event-driven microservices decoupling. Stream analytics with Dataflow. Log aggregation and metric collection. IoT data ingestion from millions of devices. Triggering Cloud Functions or Cloud Run services on events.

Medium Senior Level GCP

How does GCP IAM work and what is the difference between service accounts and user accounts?

GCP IAM (Identity and Access Management) controls who can do what on which GCP resources. It uses a policy-based model with three main components: principals (who), roles (what permissions), and resources (which resources).

Key IAM Concepts

Principals: Google accounts, service accounts, Google groups, Google Workspace domains, or Cloud Identity domains.

Roles: Collections of permissions. Three types exist — Basic roles (Owner, Editor, Viewer), Predefined roles (fine-grained, service-specific), and Custom roles (user-defined).

Policies: Bindings that attach roles to principals on a resource.

Service Accounts vs User Accounts

User Accounts represent a human user (developer, admin). They authenticate with passwords and OAuth2, are managed in Google/Cloud Identity, and are used for interactive access like gcloud CLI or Console.

Service Accounts represent an application or workload (non-human). They authenticate using cryptographic keys or workload identity, are managed in GCP per project, and are used by VMs, Cloud Functions, GKE pods, etc.

Key Differences

Service accounts have no password — they use RSA key pairs or metadata server tokens. Service accounts can be impersonated by other principals (act as). GKE Workload Identity links Kubernetes service accounts to GCP service accounts, eliminating key files. Service account keys should be rotated regularly and avoided when possible — prefer Workload Identity or ADC (Application Default Credentials).

Best Practices

Apply principle of least privilege — grant minimum required permissions. Use predefined roles over basic roles. Avoid using Editor/Owner roles in production. Use Workload Identity for GKE workloads instead of key files. Audit IAM policies with Cloud Asset Inventory and IAM Recommender.

Medium Senior Level GCP

What is Google Cloud Run and when should you use it instead of GKE?

Google Cloud Run is a fully managed serverless container platform that automatically scales containerized workloads, including to zero when not in use.

What is Cloud Run?

Cloud Run runs any stateless container that listens on HTTP. You bring a Docker image, and Cloud Run handles all infrastructure – load balancing, scaling, SSL, and billing.

Key Characteristics

Serverless: No infrastructure management, scales to zero
Pay-per-use: Billed per request + CPU/memory during request processing
Knative-based: Built on open Knative standards
Any language/framework: Works with any Docker container

Cloud Run vs GKE

Aspect	Cloud Run	GKE
Infrastructure	Fully managed	Partially managed
Scaling	Automatic (0 to N)	Manual/HPA
Cost model	Per request	Per node hour
Startup time	Cold starts ~1-2s	N/A (pods warm)
Stateful workloads	No	Yes
Custom networking	Limited	Full control
Persistent storage	No (use GCS/CloudSQL)	Yes (PV/PVC)
Long-running jobs	Limited (timeout)	Yes

When to Use Cloud Run

API backends and microservices: HTTP APIs that can be stateless
Event-driven workloads: Triggered by Pub/Sub, Cloud Scheduler, Eventarc
Batch processing: Short-lived tasks from message queues
Variable or spiky traffic: Scales to zero saves costs for low-traffic services
Prototyping and MVPs: Fast deployment without cluster setup

When to Use GKE

Stateful applications: Databases, message brokers with persistent storage
Long-running background jobs: No timeout constraints
Complex networking: Service mesh, custom ingress controllers
GPU/specialized hardware: Machine learning training workloads
Multiple containers per pod: Sidecar patterns (Envoy, log agents)
Fine-grained scaling control: Custom HPA metrics

Cloud Run Example

# Deploy a container to Cloud Run
gcloud run deploy my-service \
  --image gcr.io/PROJECT/my-app:latest \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated \
  --min-instances 1 \
  --max-instances 100 \
  --memory 512Mi

Medium Senior Level GCP

What is Google Kubernetes Engine (GKE) and how does it differ from self-managed Kubernetes?

Google Kubernetes Engine (GKE) is a fully managed Kubernetes service on Google Cloud that handles the complexity of managing Kubernetes clusters, letting teams focus on running applications.

GKE vs Self-Managed Kubernetes

Control Plane Management

GKE: Google manages the control plane (API server, etcd, scheduler, controller manager). You don’t pay for control plane compute in Standard mode; Autopilot mode is fully managed.
Self-managed: You provision, configure, secure, upgrade, and monitor all control plane components.

Node Management

GKE Standard: You manage node pools; Google handles OS patching, automatic repairs, and upgrades with your configured policies.
GKE Autopilot: Google manages nodes entirely – you only pay per Pod, not per node.
Self-managed: Full responsibility for node provisioning, OS updates, and scaling.

GKE Key Features

Release Channels

Rapid: Latest Kubernetes versions for early testing
Regular: Balanced stability (default)
Stable: Maximum stability for production

Auto Upgrade and Auto Repair

GKE automatically upgrades node pools to match the cluster version and repairs unhealthy nodes.

Workload Identity

Secure way for pods to access GCP services without service account keys:

gcloud container clusters create my-cluster \
  --workload-pool=PROJECT_ID.svc.id.goog

Node Pools

Groups of nodes with the same configuration (machine type, labels, taints). You can have multiple node pools for different workload types (CPU-optimized, GPU, spot).

GKE Autopilot

Fully managed Kubernetes:

Per-Pod billing (no unused node capacity costs)
Automatically optimizes resource requests
Built-in security baselines enforced
Google manages all node infrastructure

GKE Modes Comparison

Feature	GKE Standard	GKE Autopilot	Self-managed
Node management	Partial	Full	Full
Control plane	Managed	Managed	Self-managed
Cost model	Per node	Per pod	Infrastructure cost
Flexibility	High	Medium	Full
Operational overhead	Low	Minimal	High

Cloud-Native Integrations

Cloud Load Balancing: Automatic L7/L4 load balancer provisioning
Cloud Storage: Persistent Disk and Filestore integration
Cloud Monitoring/Logging: Built-in observability with Cloud Operations
Binary Authorization: Policy enforcement for container images
Anthos: Multi-cloud and on-premises cluster management

Medium Senior Level Docker

What are the different Docker networking modes and when would you use each?

Docker provides several networking modes (drivers) that control how containers communicate with each other and the outside world.

Network Modes Overview

1. Bridge (Default)

Containers on the same bridge network can communicate via IP or container name. Containers are isolated from the host network.

# Default bridge (docker0)
docker run -d --name web nginx

# Custom bridge network (recommended)
docker network create mynet
docker run -d --name web --network mynet nginx
docker run -d --name app --network mynet myapp
# 'app' can reach 'web' by hostname 'web'

Use when: Most container-to-container communication within a single host.

2. Host

Container shares the host’s network namespace. No network isolation, maximum performance.

docker run -d --network host nginx
# Now nginx listens on host's port 80 directly

Use when: High-performance networking, network monitoring tools, when you need host-level network access. Not available on Mac/Windows Docker Desktop.

3. None

Container has no network interface (only loopback). Complete network isolation.

docker run -d --network none myapp

Use when: Batch processing jobs that don’t need network access, maximum security isolation.

4. Overlay

Enables communication between containers on different Docker hosts. Used with Docker Swarm.

docker network create --driver overlay myoverlay

Use when: Multi-host deployments, Docker Swarm services that span multiple nodes.

5. Macvlan

Assigns a MAC address to a container, making it appear as a physical device on the network.

docker network create -d macvlan \
  --subnet=192.168.1.0/24 \
  --gateway=192.168.1.1 \
  -o parent=eth0 mymacvlan

Use when: Legacy applications that expect to be directly connected to the physical network, network monitoring.

6. IPvlan

Similar to Macvlan but containers share the host’s MAC address.

Use when: When MAC address proliferation is a concern on the network switch.

Comparison

Mode	Isolation	Performance	Use Case
Bridge	Medium	Good	Default, single host
Host	None	Best	High performance
None	Complete	N/A	Batch jobs
Overlay	Medium	Medium	Multi-host/Swarm
Macvlan	High	High	Legacy/physical apps

Best Practice

Always use custom bridge networks over the default bridge. Custom networks provide:

Automatic DNS resolution by container name
Better isolation
Dynamic connect/disconnect of containers

Medium Senior Level Docker

What are Docker multi-stage builds and how do they reduce image size?

Multi-stage builds allow you to use multiple FROM statements in a Dockerfile, enabling you to use a large build image for compilation while producing a minimal final image.

The Problem Without Multi-Stage Builds

Traditionally, developers would need separate Dockerfiles for development and production:

Development: Includes SDK, build tools, test dependencies
Production: Should only contain the runtime artifact

Result: Large production images with unnecessary build tools, larger attack surface, slower deployments.

How Multi-Stage Builds Work

Each FROM instruction starts a new build stage. You can selectively copy artifacts from one stage to another, leaving behind everything you don’t need.

Example: Go Application

# Stage 1: Build
FROM golang:1.21 AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o main .

# Stage 2: Production
FROM scratch
COPY --from=builder /app/main /main
COPY --from=builder /app/certs /etc/ssl/certs
EXPOSE 8080
CMD ["/main"]

Result: Builder image ~800MB → Production image ~10MB

Example: Node.js Application

# Stage 1: Dependencies
FROM node:20-alpine AS deps
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production

# Stage 2: Build
FROM node:20-alpine AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# Stage 3: Production
FROM node:20-alpine
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY --from=build /app/dist ./dist
CMD ["node", "dist/server.js"]

Targeting Specific Stages

# Build only up to a specific stage (useful for testing)
docker build --target builder -t myapp:builder .

# Build the final production image
docker build -t myapp:latest .

Benefits

Smaller images: Only production artifacts in final image
Better security: No compilers, debuggers, or dev tools in production
Single Dockerfile: No need to maintain separate dev/prod Dockerfiles
Improved caching: Each stage caches independently
Faster CI: Parallel stages with BuildKit

Medium Senior Level CI/CD

How do you implement container image security scanning in a CI/CD pipeline?

Container image security scanning is a critical component of modern DevSecOps pipelines that detects vulnerabilities in container images before they reach production.

Why Scan Container Images?

Detect CVEs (Common Vulnerabilities and Exposures) in OS packages and application dependencies
Ensure base images are up-to-date and patched
Enforce compliance requirements
Prevent vulnerable images from reaching production

Common Scanning Tools

Trivy (Recommended)

Open-source, comprehensive vulnerability scanner by Aqua Security.

# GitHub Actions example
- name: Run Trivy vulnerability scanner
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: 'my-registry/my-app:${{ github.sha }}'
    format: 'sarif'
    exit-code: '1'
    severity: 'CRITICAL,HIGH'
    output: 'trivy-results.sarif'

Grype

Open-source vulnerability scanner by Anchore.

grype my-registry/my-app:latest --fail-on critical

Snyk

Commercial tool with broad language and container support.

snyk container test my-registry/my-app:latest \
  --file=Dockerfile --severity-threshold=high

ECR Image Scanning (AWS)

AWS provides native scanning via Amazon Inspector or basic ECR scanning for images pushed to ECR.

Integration Strategies

Shift Left Approach

Build stage: Scan immediately after docker build
Registry push gate: Block push if critical CVEs found
Continuous monitoring: Re-scan images in registry periodically

Sample Pipeline Stage (GitHub Actions)

jobs:
  build-and-scan:
    steps:
    - name: Build image
      run: docker build -t myapp:${{ github.sha }} .

    - name: Scan image
      run: |
        trivy image --exit-code 1 \
          --severity CRITICAL,HIGH \
          myapp:${{ github.sha }}

    - name: Push image (only if scan passes)
      run: docker push myapp:${{ github.sha }}

Best Practices

Use minimal base images (distroless, alpine) to reduce attack surface
Set severity thresholds – block on CRITICAL, warn on HIGH
Scan at multiple stages: Dockerfile, built image, registry, runtime
Update base images regularly in Dockerfile
Ignore false positives using .trivyignore with tracked justifications
Integrate SBOM (Software Bill of Materials) generation alongside scanning

Medium Senior Level CI/CD

What is ArgoCD and how does it implement GitOps for Kubernetes deployments?

ArgoCD is a declarative, GitOps continuous delivery tool for Kubernetes that synchronizes application state from Git repositories to Kubernetes clusters.

How ArgoCD Works

ArgoCD follows the GitOps principle: Git is the single source of truth for application definitions. It continuously monitors Git repositories and Kubernetes clusters, reconciling any differences.

Core Workflow

Developer commits Kubernetes manifests (or Helm charts) to Git
ArgoCD detects the change in the Git repository
ArgoCD compares desired state (Git) vs actual state (cluster)
ArgoCD syncs the cluster to match Git (automatically or with approval)

Key Concepts

Application

An Application represents a deployed instance of your Kubernetes workload.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
spec:
  source:
    repoURL: https://github.com/org/app-gitops
    targetRevision: HEAD
    path: k8s/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Sync Policies

Manual: Requires human approval for each sync
Automated: Automatically syncs when drift is detected
prune: Deletes resources removed from Git
selfHeal: Reverts manual changes to the cluster

App of Apps Pattern

A parent Application manages child Applications, enabling management of multiple applications across environments.

ArgoCD vs Traditional CI/CD

Aspect	Traditional CI/CD	ArgoCD GitOps
Trigger	Push-based (CI pushes to cluster)	Pull-based (ArgoCD pulls from Git)
Credentials	CI has cluster access	ArgoCD has cluster access (no external credentials)
Drift detection	None	Continuous monitoring
Rollback	Re-run pipeline	Git revert
Audit trail	CI logs	Git history

Multi-cluster Management

ArgoCD can manage multiple Kubernetes clusters from a single control plane:

Register external clusters as ArgoCD destinations
Deploy the same application across dev/staging/prod clusters
Use ApplicationSets for templated multi-cluster deployments

Integration with CI

Typical pattern:

CI (GitHub Actions, Jenkins) builds, tests, and pushes Docker image
CI updates image tag in the GitOps repository
ArgoCD detects the change and deploys to Kubernetes

Medium Senior Level AWS

What is the difference between Amazon RDS and Aurora, and when should you use each?

Amazon RDS and Aurora are both managed relational database services from AWS, but they differ significantly in architecture, performance, and capabilities.

Amazon RDS

RDS is a managed service that handles common database administration tasks for traditional database engines.

Supported Engines

MySQL, PostgreSQL, MariaDB
Oracle, Microsoft SQL Server
Db2

Architecture

Traditional single-server or Multi-AZ setup
Synchronous replication for Multi-AZ standby
Standard EBS storage (gp2, io1)
Up to 5 read replicas

Key Features

Automated backups, patching, monitoring
Multi-AZ for high availability (standby not readable)
Point-in-time recovery
Familiar database engine compatibility

Amazon Aurora

Aurora is a cloud-native relational database engine built from the ground up for cloud performance and availability.

Supported Engines

Aurora MySQL (compatible with MySQL 5.7/8.0)
Aurora PostgreSQL (compatible with PostgreSQL)

Architecture

Distributed, shared storage layer across 6 copies in 3 AZs
Storage automatically scales from 10GB to 128TB
Up to 15 Aurora Replicas (all readable)
Continuous backup to S3

Performance Advantages

5x throughput vs MySQL RDS
3x throughput vs PostgreSQL RDS
Faster failover: typically under 30 seconds

Additional Features

Aurora Serverless: Auto-scales compute up/down to zero
Aurora Global Database: Multi-region replication with < 1 second lag
Aurora Multi-Master: Multiple read-write instances
Backtrack: Roll back database to specific point without restore

Comparison

Feature	RDS	Aurora
Engines	MySQL, PG, MSSQL, Oracle	MySQL, PostgreSQL
Storage	Single-server EBS	Distributed cluster
Read Replicas	Up to 5	Up to 15
Failover	1-2 minutes	< 30 seconds
Storage scaling	Manual	Automatic
Cost	Lower for simple workloads	Higher base cost

When to Use Each

Use RDS when:

Running Oracle or SQL Server (no Aurora equivalent)
Cost is primary concern for small workloads
You need exact MySQL/PostgreSQL feature compatibility

Use Aurora when:

High performance and availability are critical
Multi-region replication required
Serverless or variable workload patterns
Large-scale workloads > 5 read replicas needed

Medium Senior Level AWS

How does AWS Auto Scaling work and what are the different scaling policies?

AWS Auto Scaling automatically adjusts compute capacity to maintain performance and minimize costs. It monitors your applications and automatically adjusts capacity to maintain steady, predictable performance.

Core Components

Auto Scaling Group (ASG)

Defines the group of EC2 instances to scale
Specifies minimum, maximum, and desired capacity
Distributes instances across multiple Availability Zones

Launch Template / Launch Configuration

Defines the instance configuration (AMI, instance type, key pair, security groups)

Health Checks

EC2 health checks (default)
ELB health checks (recommended for web apps)

Scaling Policies

1. Target Tracking Scaling

Maintains a specific metric at a target value automatically.

Example: Keep average CPU utilization at 60%
- AWS automatically adds/removes instances to maintain this target

Best for most use cases – simple to configure and responsive.

2. Step Scaling

Scales based on CloudWatch alarm breaches with step adjustments.

Example:
- CPU 60-70%: Add 2 instances
- CPU 70-90%: Add 4 instances  
- CPU > 90%: Add 8 instances

3. Simple Scaling

Legacy policy – adds/removes a fixed number of instances based on a single alarm.
Recommend using Target Tracking or Step Scaling instead.

4. Scheduled Scaling

Scales based on predictable load patterns.

Example: Increase to 20 instances every Monday 8 AM,
reduce to 5 instances every Friday 8 PM

5. Predictive Scaling

Uses ML to predict future traffic and proactively scales in advance.

Analyzes historical patterns
Creates scaling schedules automatically
Ideal for cyclical traffic patterns

Lifecycle Hooks

Hooks allow you to run custom actions when instances launch or terminate:

Launch hook: Install software, run tests before instance joins the group
Terminate hook: Drain connections, backup data before termination

Best Practices

Use Target Tracking as the primary policy
Enable multiple AZs for fault tolerance
Use launch templates over launch configurations
Set appropriate cooldown periods to prevent rapid scaling oscillation
Use warm pools for applications with long startup times

Medium Senior Level AWS

What is the difference between ALB, NLB, and CLB in AWS?

AWS provides three types of load balancers under the Elastic Load Balancing (ELB) service, each designed for different use cases.

Application Load Balancer (ALB)

Operates at Layer 7 (HTTP/HTTPS).

Routing: Content-based routing by URL path, host, headers, query strings
Protocols: HTTP, HTTPS, WebSockets, HTTP/2, gRPC
Use cases: Microservices, container-based apps, web applications
Features: Sticky sessions, authentication (Cognito, OIDC), Lambda targets, WAF integration

Example: Route /api/* to API servers, /images/* to image servers

Network Load Balancer (NLB)

Operates at Layer 4 (TCP/UDP).

Performance: Handles millions of requests per second with extremely low latency
Protocols: TCP, UDP, TLS
Use cases: High-performance gaming, financial trading, IoT, real-time streaming
Features: Static IP addresses, Elastic IP support, preserves source IP

Classic Load Balancer (CLB)

Operates at Layer 4 and Layer 7 (legacy).

Status: Legacy – AWS recommends migrating to ALB or NLB
Protocols: HTTP, HTTPS, TCP, SSL
Limitation: Less feature-rich, cannot route to targets by port

Comparison

Feature	ALB	NLB	CLB
OSI Layer	7	4	4/7
Protocols	HTTP/HTTPS	TCP/UDP	HTTP/HTTPS/TCP
Latency	Low	Ultra-low	Medium
Static IP	No	Yes	No
WebSockets	Yes	Yes	Limited
Path routing	Yes	No	No

When to Use Which

ALB: Most web applications, microservices, REST APIs, gRPC
NLB: Ultra-high performance, TCP/UDP apps, Static IP requirement, gaming
CLB: Avoid for new workloads – migrate to ALB or NLB

Medium Senior Level AWS

What is the difference between SQS, SNS, and EventBridge in AWS?

SQS, SNS, and EventBridge are all AWS messaging services but serve different purposes and communication patterns.

Amazon SQS (Simple Queue Service)

SQS is a point-to-point message queue for decoupling distributed systems.

Pattern: Producer → Queue → Consumer (pull-based)
Delivery: At-least-once delivery, messages persist until consumed or expired
Use cases: Task queues, background job processing, load leveling
Types: Standard (best-effort ordering) and FIFO (exactly-once, ordered)

Example: Order service puts messages in SQS; fulfillment service processes them at its own pace.

Amazon SNS (Simple Notification Service)

SNS is a publish-subscribe (pub/sub) messaging service.

Pattern: Publisher → Topic → Multiple Subscribers (push-based)
Delivery: Fan-out to multiple endpoints simultaneously
Subscribers: SQS queues, Lambda functions, HTTP endpoints, email, SMS
Use cases: Fan-out notifications, alert broadcasting, mobile push

Example: Payment event publishes to SNS; billing, analytics, and email services all receive it simultaneously.

Amazon EventBridge

EventBridge is a serverless event bus for event-driven architectures.

Pattern: Event Source → Event Bus → Rules → Targets (content-based routing)
Delivery: Route events based on content/patterns
Sources: AWS services, custom apps, SaaS applications (Salesforce, Zendesk, etc.)
Use cases: Event-driven architectures, microservice decoupling, AWS service integration

Comparison

Feature	SQS	SNS	EventBridge
Pattern	Queue	Pub/Sub	Event Bus
Consumers	Single	Multiple	Multiple
Routing	FIFO/Standard	All subscribers	Content-based rules
SaaS integration	No	No	Yes
Schema registry	No	No	Yes

When to Use Which

SQS: Decouple services, handle burst traffic, ensure reliable processing
SNS: Broadcast to multiple services simultaneously
EventBridge: Complex routing, AWS service events, third-party SaaS integration
SNS + SQS: Combined fan-out with reliable processing per subscriber

Medium Senior Level Kubernetes

What is a Kubernetes Operator and when should you build one?

A Kubernetes Operator is a method of packaging, deploying, and managing a Kubernetes application using custom resources and controllers that encode operational domain knowledge.

The Operator Pattern

Operators extend Kubernetes to automate the management of complex stateful applications. They use Custom Resource Definitions (CRDs) to define new resource types and a controller to watch those resources and reconcile the actual state with the desired state.

How Operators Work

Define a CRD (e.g., PostgresCluster)
User creates a CR (Custom Resource) instance
The Operator’s controller detects the new CR
Controller takes actions to create/configure the application
Controller continuously monitors and reconciles state

Real-World Operator Examples

Prometheus Operator: Manages Prometheus, Alertmanager, and related monitoring components
cert-manager: Automates TLS certificate provisioning and renewal
Strimzi: Manages Apache Kafka clusters on Kubernetes
CloudNativePG: Manages PostgreSQL clusters
ArgoCD: GitOps continuous delivery tool with its own CRDs

When to Build an Operator

Build an Operator when:

Your application has complex operational knowledge (e.g., database failover, backup/restore)
You need to manage stateful workloads with domain-specific logic
You want to automate Day-2 operations (upgrades, scaling, recovery)
Standard Kubernetes primitives are insufficient

When NOT to Build an Operator

Stateless applications that Deployments handle well
Simple configuration management (use ConfigMaps/Helm)
When an existing operator already solves your problem

Operator Development Tools

Operator SDK: From Red Hat, supports Go, Ansible, and Helm operators
Kubebuilder: CNCF framework for building operators in Go
Metacontroller: Simplifies operator development with webhooks

Medium Senior Level Kubernetes

What is etcd and what role does it play in a Kubernetes cluster?

etcd is a distributed, reliable key-value store that serves as Kubernetes’ primary datastore for all cluster state and configuration data.

Role in Kubernetes

etcd is the single source of truth for a Kubernetes cluster. Every object you create (pods, services, configmaps, secrets, etc.) is stored in etcd. The API server reads and writes to etcd for all cluster state.

Key Characteristics

Distributed Consensus

etcd uses the Raft consensus algorithm to ensure data consistency across multiple etcd instances. A cluster typically runs 3 or 5 etcd nodes to achieve fault tolerance.

Watch Mechanism

Kubernetes controllers use etcd’s watch API to get notified of changes. For example, the scheduler watches for unscheduled pods and the controller manager watches for deployment changes.

Strong Consistency

etcd provides linearizable reads and writes, ensuring all clients see the same data at the same time.

What’s Stored in etcd

All Kubernetes objects (Pods, Deployments, Services, etc.)
Cluster configuration
RBAC policies
Secrets (encrypted at rest if configured)
Node information

etcd in Production

Backup Strategy

# Create etcd snapshot
ETCDCTL_API=3 etcdctl snapshot save snapshot.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

High Availability

Run odd number of nodes (3, 5, 7)
A cluster of 3 tolerates 1 failure
A cluster of 5 tolerates 2 failures
Use dedicated SSDs for low latency

Why etcd Performance Matters

etcd latency directly impacts API server response time. Slow etcd = slow kubectl, slow deployments, and cluster instability. Always monitor etcd disk I/O and latency metrics.

Medium Senior Level Kubernetes

What is the difference between a Kubernetes Job and a CronJob?

Kubernetes Jobs and CronJobs are workload resources for running tasks to completion rather than running continuously like Deployments.

Kubernetes Job

A Job creates one or more pods and ensures a specified number of them successfully terminate. Once the required completions are reached, the Job is complete.

Use Cases

Database migrations
Batch data processing
One-time setup tasks
Report generation

Example Job

apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration
spec:
  completions: 1
  parallelism: 1
  backoffLimit: 3
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: migrate
        image: myapp:latest
        command: ["python", "manage.py", "migrate"]

Job Patterns

Non-parallel: Single pod runs to completion
Parallel with fixed count: Multiple pods, each does a portion
Parallel with work queue: Pods process items from a queue

CronJob

A CronJob creates Jobs on a repeating schedule using standard Unix cron syntax.

Use Cases

Nightly database backups
Hourly report generation
Periodic cleanup tasks
Scheduled ETL pipelines

Example CronJob

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-backup
spec:
  schedule: "0 2 * * *"  # 2 AM every day
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: backup
            image: backup-tool:latest
            command: ["/backup.sh"]

Key Differences

Feature	Job	CronJob
Trigger	Manual/one-time	Scheduled (cron)
Recurrence	Runs once	Repeats on schedule
Use case	Ad-hoc tasks	Recurring tasks
Creates	Pods directly	Jobs (which create pods)

Medium Senior Level Kubernetes

What are Init Containers in Kubernetes and what problems do they solve?

Init Containers are specialized containers that run and complete before the main application containers start in a pod.

How Init Containers Work

Init containers run sequentially – each must complete successfully before the next one starts, and all must succeed before the app containers start. If an init container fails, Kubernetes retries according to the pod’s restart policy.

Problems They Solve

1. Dependency Waiting

Wait for a service to be ready before the app starts:

initContainers:
- name: wait-for-db
  image: busybox
  command: ['sh', '-c', 'until nc -z db-service 5432; do sleep 2; done']

2. Pre-initialization Tasks

Clone a Git repository into a shared volume
Download configuration files from a remote source
Run database migrations before the app starts

3. Security Isolation

Run privileged setup tasks in an init container while the main container runs with minimal privileges.

4. Delay App Start

Wait for custom resources or CRDs to be registered before the app that uses them starts.

Init vs Sidecar Containers

Feature	Init Container	Sidecar Container
Lifecycle	Runs once and exits	Runs alongside main
Purpose	Setup/preparation	Supporting services
Parallel	Sequential	Parallel with main

Example

spec:
  initContainers:
  - name: init-myservice
    image: busybox
    command: ['sh', '-c', 'until nslookup myservice; do sleep 2; done']
  containers:
  - name: myapp
    image: myapp:latest

Medium Senior Level Kubernetes

What is a DaemonSet in Kubernetes and when would you use it?

A DaemonSet ensures that a copy of a pod runs on all (or specific) nodes in a Kubernetes cluster. When nodes are added to the cluster, the DaemonSet automatically schedules a pod on them.

How DaemonSets Work

Unlike Deployments which control a specific number of replicas, DaemonSets ensure one pod per matching node. When a node is removed, the pod is garbage collected.

Common Use Cases

Log collection agents: Fluentd, Filebeat – collect logs from every node
Monitoring agents: Prometheus Node Exporter, Datadog Agent – collect node metrics
Network plugins: CNI plugins like Calico, Flannel run as DaemonSets
Storage drivers: Ceph, GlusterFS storage daemons
Security agents: Falco, Sysdig for runtime security monitoring

Example DaemonSet

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
    spec:
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: node-exporter
        image: prom/node-exporter:latest
        ports:
        - containerPort: 9100

DaemonSet vs Deployment

Feature	DaemonSet	Deployment
Replicas	1 per node	Fixed count
Scaling	Auto with nodes	Manual/HPA
Use case	Node-level services	Stateless apps

Node Selection

Use nodeSelector or nodeAffinity to restrict a DaemonSet to specific nodes (e.g., only GPU nodes, only Linux nodes).

Medium Senior Level Kubernetes

What is Helm and how does it simplify Kubernetes application deployment?

Helm is the package manager for Kubernetes, making it easy to define, install, and upgrade complex Kubernetes applications.

Core Concepts

Charts

A Helm chart is a collection of files that describe Kubernetes resources. It contains:

templates/: Kubernetes manifests with Go template syntax
values.yaml: Default configuration values
Chart.yaml: Chart metadata (name, version, description)
charts/: Dependencies

Releases

When a chart is installed, a release is created. Multiple releases of the same chart can run in the same cluster with different configurations.

Repositories

Charts are stored in and shared via Helm repositories (e.g., ArtifactHub, Bitnami).

Common Commands

# Add a repository
helm repo add bitnami https://charts.bitnami.com/bitnami

# Search charts
helm search repo nginx

# Install a chart
helm install my-nginx bitnami/nginx -f custom-values.yaml

# Upgrade a release
helm upgrade my-nginx bitnami/nginx --set replicas=3

# Rollback
helm rollback my-nginx 1

# List releases
helm list -A

Benefits

Templating: Reuse manifests with different values per environment
Version management: Track chart versions and rollback easily
Dependency management: Bundle related charts together
Release lifecycle: Install, upgrade, rollback, uninstall with single commands

Helm 3 vs Helm 2

Helm 3 removed Tiller (the server-side component), making it more secure by using Kubernetes RBAC directly and storing release state as Kubernetes Secrets.

Medium Senior Level

What is Amazon CloudFront?

Amazon CloudFront is a content delivery network (CDN) service that securely delivers data, videos, applications, and APIs to customers globally with low latency and high transfer speeds. It integrates with other AWS services and uses edge locations worldwide to cache and serve content closer to end users. CloudFront supports both static and dynamic content delivery, provides DDoS protection through AWS Shield, and offers SSL/TLS encryption. It can serve content from S3 buckets, EC2 instances, Elastic Load Balancers, or custom origins. CloudFront also supports Lambda@Edge for running code at edge locations.

Medium Senior Level System Design

What is Zero Trust Architecture and how does it apply to DevOps?

Zero Trust is a security model based on “never trust, always verify.” Traditional networks trusted everything inside the perimeter. Zero trust assumes the network is already compromised.

Zero Trust principles in DevOps:

Identity-based access: Every service authenticates. No implicit trust based on network location.
Least privilege: Minimal permissions for every identity, re-evaluated regularly.
Micro-segmentation: Kubernetes NetworkPolicies and service meshes with mTLS between every service.
Device trust: Verify developer machines with fleet management (Jamf, Intune) before allowing access to internal systems.
Continuous verification: Short-lived credentials. Re-authenticate frequently.

Hard Lead / Architect Level System Design

How do you implement security scanning in a GitHub Actions CI/CD pipeline?

A comprehensive security scanning pipeline:

jobs:
  security:
    runs-on: ubuntu-latest
    steps:
      # SAST — Static code analysis
      - uses: actions/checkout@v4
      - name: Run Semgrep
        uses: returntocorp/semgrep-action@v1

      # Dependency scanning
      - name: Run Snyk
        uses: snyk/actions/node@master
        env:
          SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}

      # Container image scanning
      - name: Build image
        run: docker build -t myapp:${{ github.sha }} .
      - name: Run Trivy
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: myapp:${{ github.sha }}
          severity: CRITICAL,HIGH
          exit-code: 1

      # IaC scanning
      - name: Run tfsec
        uses: aquasecurity/tfsec-action@v1.0.0

Medium Senior Level System Design

What is a bastion host (jump server) and what are the modern alternatives?

A bastion host is a dedicated, hardened server in a public subnet used as the only entry point for SSH/RDP into private subnet resources. All access is logged and audited.

Modern, better alternatives:

AWS Systems Manager Session Manager: SSH into EC2 over HTTPS through the AWS API. No open port 22 required. All sessions logged to CloudWatch/S3. IAM-controlled access.
Teleport: Open-source access platform with MFA, session recording, and role-based access for SSH, Kubernetes, databases, and web applications.
Tailscale / WireGuard: Zero-config VPN mesh that avoids exposing any servers publicly.

Hard Lead / Architect Level System Design

How do you implement secrets rotation without downtime?

Secret rotation is a critical security practice. Zero-downtime rotation process:

Generate new secret without invalidating the old one (e.g., create a new DB user, or generate a new API key that coexists with the old one).
Update secret store (AWS Secrets Manager, Vault) with the new value.
Rotate applications: Applications use External Secrets Operator or Vault Agent to pick up new values. Configure TTL on cached secrets so they refresh within minutes.
Verify: Confirm all services are using the new secret.
Revoke old secret.

AWS Secrets Manager has native rotation with Lambda functions for RDS passwords. This can be fully automated.

Hard Lead / Architect Level System Design

How do you implement network segmentation for a microservices application?

Network segmentation limits the blast radius of a compromise. In a microservices context:

AWS: Security Groups + VPC design: Place services in private subnets. Use security groups to only allow traffic between services that need to communicate (e.g., allow port 5432 only from the API service to the database SG).
Kubernetes: NetworkPolicies: Default-deny all inter-pod traffic. Explicitly allow only required paths.
Service Mesh (Istio/Linkerd): Mutual TLS (mTLS) between all services — all communication is encrypted and authenticated at the network level. Zero-trust networking.

Medium Senior Level System Design

What is SAST vs DAST and where do they fit in a DevSecOps pipeline?

SAST (Static Application Security Testing): Analyzes source code without executing it. Runs early in CI (on every commit/PR). Tools: Semgrep, SonarQube, Bandit (Python), gosec (Go). Fast, no running application needed.

DAST (Dynamic Application Security Testing): Tests the running application by sending malicious inputs and analyzing responses. Runs against a deployed staging environment. Tools: OWASP ZAP, Burp Suite. Finds runtime vulnerabilities that SAST misses (SQL injection, auth bypass).

DevSecOps pipeline: SAST on PR → build image → Trivy scan → deploy to staging → DAST → promote to prod.

Easy Associate Level System Design

What is the principle of least privilege and why is it critical in DevOps?

The principle of least privilege (PoLP) states that any user, process, or service should only have the minimum permissions necessary to perform its function — nothing more.

In DevOps this applies to:

IAM roles: A Lambda function that reads from S3 should only have s3:GetObject on that specific bucket, not full S3 access.
Kubernetes RBAC: A deployment automation service account only needs update permissions on Deployments, not cluster-admin.
CI/CD tokens: A build token should be able to push to a registry but not manage IAM users.

Blast radius reduction: if credentials are compromised, least privilege limits what an attacker can do.

Easy Associate Level Linux

What is the difference between SSH key authentication and password authentication?

Password authentication: User provides a password. Vulnerable to brute-force attacks, password spraying, and phishing. Should be disabled for SSH in production.

SSH Key authentication: The client proves ownership of a private key without ever transmitting it. The server holds the public key in ~/.ssh/authorized_keys. Private key never leaves the client.

# Generate key pair
ssh-keygen -t ed25519 -C "anmol@devopsinterview.com"

# Copy public key to server
ssh-copy-id -i ~/.ssh/id_ed25519.pub user@server

# Disable password auth in /etc/ssh/sshd_config
PasswordAuthentication no

Use ed25519 keys — they are faster and more secure than RSA 2048.

Medium Senior Level Linux

How do you use awk, sed, and grep together to parse log files?

These three tools form the backbone of Linux log analysis:

# grep: Filter lines containing "ERROR"
grep "ERROR" /var/log/app.log

# awk: Extract specific fields (e.g., column 3 of an NGINX access log)
awk '{print $3}' /var/log/nginx/access.log

# sed: Replace or transform text
sed 's/ERROR/CRITICAL/g' app.log

# Combined pipeline: Find ERROR lines, extract IP (field 1), count by IP
grep "ERROR" /var/log/nginx/access.log \
  | awk '{print $1}' \
  | sort \
  | uniq -c \
  | sort -rn \
  | head -10

Hard Lead / Architect Level Linux

Explain how the Linux kernel handles I/O with the page cache.

The Linux kernel uses the page cache to cache file data in RAM to speed up I/O. When you read a file, the kernel copies it into page cache. Subsequent reads are served from RAM (microseconds) instead of disk (milliseconds).

Writes are also cached: data is written to the page cache first and then persisted to disk asynchronously (write-back). This is why free -h shows most RAM as “used” on a healthy server — the kernel aggressively caches. This is not a memory leak.

Relevant commands: vmstat, iostat, /proc/meminfo (Cached, Buffers), echo 3 > /proc/sys/vm/drop_caches to flush cache (dangerous in production).

Medium Senior Level Linux

Write a Bash script to find and delete log files older than 30 days.

#!/bin/bash
# Delete log files older than 30 days in /var/log/myapp

LOG_DIR="/var/log/myapp"
DAYS=30
DRY_RUN=false  # Set to false to actually delete

if [ ! -d "$LOG_DIR" ]; then
    echo "Directory $LOG_DIR does not exist"
    exit 1
fi

if [ "$DRY_RUN" = true ]; then
    echo "Dry run — files that would be deleted:"
    find "$LOG_DIR" -name "*.log" -mtime +$DAYS -print
else
    echo "Deleting log files older than $DAYS days..."
    find "$LOG_DIR" -name "*.log" -mtime +$DAYS -delete
    echo "Done. Freed up space:"
    df -h "$LOG_DIR"
fi

Always implement a dry run mode. Schedule this with cron or use logrotate for production systems.

Medium Senior Level Observability

What is log aggregation and how do you implement it with the ELK stack?

Log aggregation centralizes logs from all services into one searchable system. The ELK Stack:

Elasticsearch: Distributed search and analytics engine that indexes and stores logs.
Logstash: Data processing pipeline that ingests, transforms, and forwards logs.
Kibana: Web UI for searching, visualizing, and creating dashboards from Elasticsearch data.

Modern replacement: The EFK Stack uses Fluent Bit (lightweight, lower memory than Logstash) as a DaemonSet in Kubernetes to collect container logs and forward to Elasticsearch. Or use Loki (from Grafana Labs) for a simpler, cost-effective log aggregation layer.

Hard Lead / Architect Level Observability

What is distributed tracing and how do you implement it with OpenTelemetry?

In a microservices architecture, a single user request touches dozens of services. Distributed tracing follows that request across all services, recording timing and metadata at each step.

OpenTelemetry (OTel) is the vendor-neutral standard for instrumentation. Implementation:

Add the OTel SDK to each service.
Services automatically propagate a traceparent header in HTTP calls, linking all spans.
A collector (OTel Collector) receives spans and routes them to your backend (Jaeger, Zipkin, Tempo, Datadog).
You can now visualize the full request path, identify slow spans, and pinpoint errors.

Medium Senior Level Observability

How do you structure a Grafana dashboard for a production service?

A well-structured production dashboard follows the USE or RED methodology:

RED (for services):

Rate: Requests per second
Errors: Error rate (%)
Duration: Latency (p50, p90, p99)

Top-level layout: Start with an SLO summary panel so on-call knows immediately if SLO is being violated. Then drill-down panels: per-endpoint breakdown, error log links, infrastructure metrics (CPU, memory). Use variables for environment and service selection.

Hard Lead / Architect Level Observability

How do you avoid alert fatigue in a large-scale microservices environment?

Alert fatigue happens when teams receive too many alerts, many of which are noise. Engineers start ignoring them — including real critical ones.

Strategies to combat it:

Symptom-based alerting: Alert on user-facing symptoms (error rate, latency) not causes (CPU high). CPU high does not always mean users are impacted.
Actionable alerts only: Every alert must have a clear runbook. If there’s no action to take, it shouldn’t be an alert.
SLA-based alerting: Alert when you’re burning through your error budget too fast.
Regular alert audits: Review and delete alerts that consistently fire without requiring action.
Severity tiers: P1 wakes someone up. P3 creates a ticket. Many alerts should be P3.

Easy Associate Level AWS

What is the difference between horizontal and vertical scaling in AWS?

Vertical Scaling (Scale Up): Increase the size of an existing instance (e.g., t3.medium → c5.4xlarge). Simple but has a ceiling (there’s a maximum instance size). Requires downtime to resize EC2.

Horizontal Scaling (Scale Out): Add more instances behind a load balancer. No theoretical ceiling. Enables high availability and fault tolerance because traffic is spread across multiple instances in multiple AZs.

AWS Auto Scaling Groups with Application Load Balancers enable fully automated horizontal scaling based on metrics like CPU or custom CloudWatch metrics.

Medium Senior Level AWS

What is AWS CloudWatch and what are its main components?

CloudWatch is AWS’s native observability service with four main areas:

Metrics: Time-series data from AWS services (CPU, NetworkIn, etc.) and custom metrics you publish.
Logs: CloudWatch Logs for storing, searching, and analyzing log data from EC2, Lambda, ECS, etc.
Alarms: Alerts triggered when metrics exceed thresholds. Can trigger SNS, Auto Scaling, Lambda.
Dashboards: Visual widgets to display metrics across services in real-time.

For advanced analytics, ship logs to OpenSearch (ELK) or use CloudWatch Logs Insights for SQL-like queries.

Hard Lead / Architect Level AWS

Explain AWS Lambda cold starts and how to mitigate them in production.

A cold start occurs when Lambda needs to initialize a new execution environment — download the code, start the runtime, run your initialization code. This adds 100ms-1s+ of latency on the first request.

Mitigation strategies:

Provisioned Concurrency: Pre-warm a set number of Lambda execution environments. Eliminates cold starts for warmed instances (at extra cost).
Minimize package size: Smaller deployment packages initialize faster.
Use faster runtimes: Node.js and Python cold start faster than Java/C#.
Move init code outside the handler: DB connections and SDK clients initialized at module level persist across invocations.
Lambda SnapStart (Java): AWS-managed snapshot of initialized execution environment.

Medium Senior Level AWS

How do you reduce AWS costs in a cloud environment? What are your go-to strategies?

Cloud cost optimization is an ongoing practice. High-impact strategies:

Right-sizing: Use AWS Cost Explorer and Compute Optimizer to identify oversized EC2 instances.
Reserved Instances/Savings Plans: Commit to 1-3 years for stable workloads — saves up to 72%.
Spot Instances: Use for stateless, fault-tolerant, or batch workloads. Up to 90% savings.
S3 Lifecycle policies: Auto-transition to cheaper storage tiers.
Delete idle resources: Audit unused EIPs, old snapshots, unattached EBS volumes.
Auto Scaling: Scale down to zero or minimum outside business hours.

Hard Lead / Architect Level AWS

How does IAM assume-role work and how do you implement cross-account access securely?

Cross-account access uses the sts:AssumeRole API. A role in Account B has a trust policy that allows Account A to assume it:

# Trust policy on role in Account B
{
  "Principal": {
    "AWS": "arn:aws:iam::ACCOUNT_A_ID:root"
  },
  "Action": "sts:AssumeRole"
}

Account A’s entity calls aws sts assume-role to get temporary credentials (up to 12 hours) for Account B. Security controls:

Add ExternalId condition for third-party access (prevents confused deputy attacks)
Add MFA condition for sensitive roles
Use SCPs at the AWS Organization level to restrict what can be assumed

Hard Lead / Architect Level AWS

How would you architect a highly available, multi-region AWS deployment?

Multi-region HA involves several layers:

DNS: Route53 with health checks and latency/failover routing policies to direct users to the nearest healthy region.
Data replication: RDS Multi-Region Read Replicas with promotion capability. DynamoDB Global Tables for active-active.
Edge: CloudFront CDN with origins in multiple regions.
Infrastructure: Identical infrastructure in each region managed by Terraform.
DR strategy: Define RTO (Recovery Time Objective) and RPO (Recovery Point Objective) to determine your architecture (Pilot Light, Warm Standby, or Active-Active).

Medium Senior Level Terraform

Explain the Terraform resource lifecycle and meta-arguments like create_before_destroy.

The lifecycle block gives you fine-grained control over how Terraform manages resource replacement:

resource "aws_instance" "web" {
  ami           = "ami-12345"
  instance_type = "t3.medium"

  lifecycle {
    create_before_destroy = true  # New instance created before old one is destroyed
    ignore_changes = [ami]         # Ignore external AMI changes
    prevent_destroy = true         # Block accidental deletion
  }
}

create_before_destroy is critical for zero-downtime replacements. Without it, Terraform destroys the old resource first, creating a gap in availability.

Easy Associate Level Terraform

What is the purpose of terraform.tfvars files?

terraform.tfvars files provide values for your declared variables, keeping configuration separate from the variable definitions. This allows you to have different values per environment without modifying the core modules.

# variables.tf — defines the variable
variable "instance_type" {
  description = "EC2 instance type"
  type        = string
}

# production.tfvars — provides the value
instance_type = "c5.2xlarge"

# development.tfvars
instance_type = "t3.micro"

Never commit .tfvars files containing sensitive values to Git. Use .gitignore and pass sensitive values via environment variables (TF_VAR_*) in CI/CD.

Hard Lead / Architect Level Terraform

How do you implement Terraform in a CI/CD pipeline safely?

Running Terraform in CI/CD requires careful guardrails:

PR triggers plan: On every pull request, run terraform plan and post the output as a PR comment (using tools like Atlantis or terraform-pr-commenter).
Merge triggers apply: Only apply after PR is merged to main. Require manual approval for production.
State locking: Ensure DynamoDB locking is configured to prevent concurrent applies.
OIDC credentials: Use OIDC to get short-lived tokens from AWS instead of storing long-lived access keys.
Plan artifacts: Save the plan file and apply that exact file — never re-plan at apply time.

Medium Senior Level Terraform

What are Terraform data sources and how do they differ from resources?

A resource creates, updates, or destroys infrastructure. A data source reads existing infrastructure that is managed outside of your current Terraform code — it is read-only.

# Data source — reads an existing VPC by tag, does not create it
data "aws_vpc" "main" {
  tags = {
    Environment = "production"
  }
}

# Use the data source output
resource "aws_subnet" "app" {
  vpc_id = data.aws_vpc.main.id
  ...
}

Data sources are essential for referencing shared infrastructure managed by a different team or Terraform root module.

Hard Lead / Architect Level Terraform

What is Terraform state drift and how do you handle it?

State drift occurs when the real infrastructure differs from what Terraform state believes it to be — typically due to manual changes made in the AWS console or another tool.

Detection: terraform plan will show changes that seem unexpected.

Resolution options:

Import: terraform import to import manually created resources into state.
Refresh: terraform refresh to update state to match reality (deprecated in favor of plan -refresh-only).
Accept drift: Use lifecycle { ignore_changes = [...] } for intentionally externally-managed attributes.

Prevention: Forbid all manual console access to production environments using IAM SCPs.

Easy Associate Level Terraform

What does terraform plan do and why should you always review it before applying?

terraform plan creates an execution plan — a preview of what Terraform will do before it actually makes changes. It shows additions, modifications, and destructions.

Always review the plan because:

It may show unexpected destructions (e.g., a stateful database being replaced instead of modified)
It catches misconfiguration before real infrastructure is affected
In a CI/CD pipeline, save the plan output and apply that exact plan in the next step to ensure consistency

terraform plan -out=tfplan
terraform apply tfplan

Medium Senior Level CI/CD

How do you handle database migrations in a CI/CD pipeline without downtime?

Database migrations are one of the riskiest parts of deployment. The golden rule: migrations must be backward-compatible because during a rolling deploy, old code and new code run simultaneously.

Safe migration checklist:

Never: Rename or drop a column in the same deploy that uses the new name.
Step 1: Add new column (nullable, backward-compatible).
Step 2: Deploy code that writes to both old and new columns.
Step 3: Migrate existing data.
Step 4: Deploy code using only the new column.
Step 5: Drop the old column.

Medium Senior Level CI/CD

What is the purpose of a staging environment and what tests should run there?

Staging is a production-mirror environment used to catch bugs that only appear with real data, full infrastructure, and realistic load — things unit tests can’t surface. Tests to run in staging:

Integration tests: Real database connections, real API calls to third parties.
E2E tests: Cypress, Playwright, or Selenium to simulate real user journeys.
Smoke tests: Quick sanity checks that critical paths work after deployment.
Performance tests: Load tests with k6 or Locust to catch regressions.

Hard Lead / Architect Level CI/CD

How do you implement a multi-environment deployment pipeline (dev → staging → prod)?

A professional multi-environment pipeline uses gates between stages:

Build once: A single immutable artifact (Docker image with SHA tag) is promoted — never rebuilt.
Deploy to Dev: Automatic on every merge to main.
Deploy to Staging: Automatic after dev health checks pass. Run integration and smoke tests.
Deploy to Prod: Manual approval gate + scheduled deployment window.

The key is that the same image moves through all environments. This ensures what you tested in staging is exactly what runs in production.

Easy Associate Level CI/CD

What is a pipeline artifact and what are common examples?

A pipeline artifact is any file produced by a CI/CD job that needs to be passed to downstream jobs or stored for later use.

Common examples:

Compiled binary or JAR file (Java/Go)
Built Docker image pushed to a registry
Frontend build output (dist/ or build/ folder)
Test reports and coverage reports
SBOM (Software Bill of Materials) files
Terraform plan output

Medium Senior Level CI/CD

How do you speed up slow CI pipelines?

Slow pipelines kill developer productivity. Key optimizations:

Caching: Cache dependencies (node_modules, pip packages, Go modules) between runs.
Parallelism: Split test suites and run jobs in parallel.
Test selection: Only run tests affected by the changed code.
Optimized Docker builds: Use layer caching and BuildKit.
Self-hosted runners: Eliminate queue time and use faster hardware.
Fail fast: Run linting and unit tests first; integration tests only if those pass.

Medium Senior Level CI/CD

What is GitOps and how does it differ from traditional CI/CD?

Traditional CI/CD: The pipeline has credentials and directly pushes deployments to environments (push-based).

GitOps: Git is the single source of truth for the desired state of your infrastructure and applications. An agent running in the cluster (like ArgoCD or Flux) continuously reconciles the actual state with the desired state in Git (pull-based).

Benefits of GitOps: Drift detection, audit trail in Git history, easy rollback (git revert), no outbound credentials needed in CI.

Hard Lead / Architect Level CI/CD

How do you structure a mono-repo CI/CD pipeline to avoid unnecessary builds?

In a monorepo with 20+ services, you must only trigger builds for services that actually changed. Strategies:

Path filters: GitHub Actions paths: filter to trigger workflows only when specific directories change.
Nx / Turborepo: Task runners with build graph awareness that skip unchanged services.
git diff: Compare changed files against the base branch and only build affected services.

# GitHub Actions path filter
on:
  push:
    paths:
      - "services/api/**"
      - "shared/lib/**"

Medium Senior Level CI/CD

How do you implement automated rollback in a deployment pipeline?

Automated rollback is triggered when post-deployment health checks fail. A robust implementation:

Health check gate: After deployment, poll the health endpoint for 2-3 minutes.
Metric thresholds: Monitor error rate and p99 latency for 5 minutes post-deploy.
Rollback trigger: If error rate exceeds a threshold, automatically re-deploy the previous image tag.

# Generic shell rollback logic
NEW_VERSION="v2.0"
PREV_VERSION="v1.9"

deploy $NEW_VERSION
if ! health_check_passes; then
  echo "Rollback triggered"
  deploy $PREV_VERSION
  alert_pagerduty "Automatic rollback executed"
fi

Easy Associate Level CI/CD

Why do you use branch protection rules in a CI/CD workflow?

Branch protection rules on the main or production branch enforce quality gates before any code is merged:

Require pull request reviews (at least 1-2 approvals)
Require status checks to pass (CI build, tests, linting)
Require branches to be up to date before merging
Prevent force pushes and branch deletion

This ensures no untested or unreviewed code ever reaches production, which is the foundation of a trustworthy deployment pipeline.

Medium Senior Level CI/CD

What is the difference between a Blue/Green deployment and a Canary deployment?

Blue/Green: You maintain two identical environments. “Blue” is live, “Green” has the new version. You switch all traffic from Blue to Green at once. Rollback is instant — just switch back. Downside: doubles infrastructure cost.

Canary: You gradually shift traffic from the old version to the new one — e.g., 5% → 25% → 50% → 100%. You analyze metrics and errors at each stage. Slower but safer for catching issues that only appear under real production load.

Hard Lead / Architect Level CI/CD

How do you secure a CI/CD pipeline from supply chain attacks?

Supply chain attacks (like SolarWinds, XZ Utils) target the build pipeline itself. Defense layers:

Pin action versions: Use commit SHA, not floating tags like @v2. uses: actions/checkout@abc123
SBOM generation: Generate a Software Bill of Materials at build time using Syft.
Image signing: Sign images with Cosign (Sigstore). Verify signatures before deployment.
Least privilege: GitHub Actions tokens should have minimal permissions. Set permissions: read-all by default.
Dependency review: Use Dependabot or Renovate for automated dependency updates.

Medium Senior Level CI/CD

How do you implement secret management in a GitHub Actions pipeline?

Never hardcode secrets in your pipeline files. GitHub Actions provides an encrypted Secrets store:

Go to Repository Settings → Secrets and Variables → Actions → New Repository Secret.
Reference in your workflow: ${{ secrets.MY_SECRET }}

- name: Deploy to AWS
  env:
    AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
    AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
  run: aws s3 sync ./dist s3://my-bucket

For more advanced use cases, use OIDC to get short-lived tokens from AWS/GCP instead of storing static credentials.

Easy Associate Level CI/CD

What is the difference between Continuous Integration, Continuous Delivery, and Continuous Deployment?

Continuous Integration (CI): Developers merge code frequently (multiple times a day). Every merge triggers an automated build and test run to catch integration issues early.

Continuous Delivery (CD): Every passing build is automatically prepared for release to production. A human approves the final deployment step.

Continuous Deployment: Extends Delivery — every passing build is automatically deployed to production with no human intervention.

Easy Associate Level Docker

What is Docker Compose and when would you use it?

Docker Compose is a tool for defining and running multi-container applications using a YAML file. It is ideal for local development and testing where you need to spin up interdependent services (app + database + cache) with a single command.

docker compose up -d

It handles networking (all services in the same file can reach each other by service name), volume management, and environment variables. For production orchestration, use Kubernetes instead.

Medium Senior Level Docker

How do Docker volumes differ from bind mounts?

Docker Volumes are managed by Docker, stored in /var/lib/docker/volumes/, and are the recommended way to persist data. They are portable, easy to back up, and work well with Docker Compose.

Bind Mounts map a specific host path directly into the container. They are useful in development to sync source code in real-time but are host-dependent and harder to manage in production.

# Volume (recommended for production)
docker run -v mydata:/app/data myapp

# Bind mount (recommended for development)
docker run -v $(pwd)/src:/app/src myapp

Hard Lead / Architect Level Docker

How do you scan Docker images for vulnerabilities in a CI/CD pipeline?

Image scanning should be a mandatory gate before pushing to production. Tools and integration steps:

Trivy (Aqua): Fast, comprehensive, easy CI integration. trivy image myapp:latest
Snyk: Deep dependency scanning with developer-friendly output.
Docker Scout: Built into Docker Hub.
Grype: From Anchore, works well with SBOM workflows.

# GitHub Actions example
- name: Scan image with Trivy
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: myapp:${{ github.sha }}
    severity: CRITICAL,HIGH
    exit-code: 1  # Fail the pipeline on critical vulnerabilities

Medium Senior Level Docker

Explain Docker layer caching and how it impacts build speed.

Docker builds images layer by layer. If a layer hasn’t changed since the last build, Docker reuses the cached version. The trick is layer ordering:

Bad: COPY all files first, then run npm install. Any code change invalidates the npm install cache.

Good: COPY package.json first, run npm install, then COPY the rest of the source. Dependency installation only re-runs when package.json changes.

# Optimized layer order
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build

Medium Senior Level Kubernetes

What is the purpose of a PodDisruptionBudget (PDB) in Kubernetes?

A PodDisruptionBudget limits how many pods of a deployment can be unavailable simultaneously during voluntary disruptions like node drains, cluster upgrades, or scaling down.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api

Without a PDB, a cluster upgrade could drain multiple nodes simultaneously and take down your entire service. With minAvailable: 2, Kubernetes ensures at least 2 pods are always running.

Hard Lead / Architect Level Kubernetes

Explain Kubernetes network policies and how you would isolate a production namespace.

By default, all pods in a Kubernetes cluster can communicate with each other freely. NetworkPolicies are namespace-scoped firewall rules that control which pods can talk to which.

To enforce full isolation on a namespace, start by denying all ingress and egress, then selectively allow only what’s needed:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

Then add specific allow rules for your database, monitoring agents, and DNS (port 53).

Medium Senior Level Kubernetes

How does the Kubernetes Horizontal Pod Autoscaler (HPA) work?

HPA automatically scales the number of pod replicas based on observed metrics. The default metric is CPU utilization, but it also supports memory and custom metrics via the Metrics API.

kubectl autoscale deployment my-app --min=2 --max=10 --cpu-percent=60

The HPA controller checks metrics every 15 seconds (default) and adjusts replicas to maintain the target. For custom metrics, you can integrate tools like KEDA (Kubernetes Event-Driven Autoscaling) which can scale based on Kafka lag, SQS queue depth, and more.

Hard Lead / Architect Level Kubernetes

How do you manage secrets securely in Kubernetes? What are the alternatives to plain Kubernetes Secrets?

Kubernetes Secrets are base64-encoded, not encrypted by default. For production, consider these approaches:

Encryption at Rest: Enable EncryptionConfiguration to encrypt secrets in etcd.
External Secrets Operator: Syncs secrets from AWS Secrets Manager, GCP Secret Manager, or HashiCorp Vault into Kubernetes Secrets automatically.
HashiCorp Vault Agent Injector: Injects secrets directly into Pod filesystems without storing them in Kubernetes at all.
Sealed Secrets: Encrypts secrets client-side so they are safe to commit to Git.

Medium Senior Level Kubernetes

How do services in different namespaces communicate in Kubernetes?

All services in a Kubernetes cluster are reachable via DNS using the Fully Qualified Domain Name (FQDN):

<service-name>.<namespace>.svc.cluster.local

For example, a service named postgres in the production namespace is reachable at postgres.production.svc.cluster.local from any pod in any namespace. If NetworkPolicies are in place, you must explicitly allow cross-namespace traffic.

Medium Senior Level Kubernetes

What is the difference between a StatefulSet and a Deployment?

Use a Deployment for stateless workloads (web servers, APIs) where any Pod is interchangeable. Use a StatefulSet for stateful workloads like databases that need:

Stable, predictable network identities (pod-0, pod-1, etc.)
Ordered, graceful deployment and scaling
Stable persistent storage linked to each pod individually

Common examples: Kafka, ZooKeeper, Cassandra, PostgreSQL replicas.

Medium Senior Level Kubernetes

How do you perform a zero-downtime rolling update in Kubernetes?

Kubernetes Deployments support RollingUpdate strategy by default. The key is configuring maxSurge and maxUnavailable correctly alongside working readiness probes.

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0

With maxUnavailable: 0, Kubernetes will never take down an old Pod until the new one is healthy (as determined by its readiness probe). This guarantees zero downtime.

Easy Associate Level Kubernetes

What is the difference between a Pod and a Deployment in Kubernetes?

A Pod is the smallest deployable unit in Kubernetes — it wraps one or more containers that share the same network and storage. However, Pods on their own are ephemeral.

A Deployment is a higher-level abstraction that manages Pods. It ensures a specified number of Pod replicas are running at all times, handles rolling updates, and allows rollbacks. You almost never create bare Pods in production; you use Deployments instead.

kubectl create deployment nginx --image=nginx:1.25 --replicas=3

Easy Associate Level Kubernetes

Explain the role of ‘Sidecar’ containers in Kubernetes pod architecture.

A sidecar container is a secondary container that runs along with the main application container within the same pod. It is used to extend and enhance the functionality of the main container, such as by providing logging, monitoring, or proxy services.

Medium Senior Level Kubernetes

What is a ‘StatefulSet’ and when should you use it over a ‘Deployment’ in Kubernetes?

A StatefulSet is used for stateful applications that require unique, persistent identities and stable network identifiers. Unlike Deployments, which are for stateless pods, StatefulSets manage pods that are not interchangeable and have sticky identities.

Hard Lead / Architect Level Kubernetes

How do you implement Zero-Downtime deployments with Kubernetes Service objects?

Discuss RollingUpdate strategies, readiness probes, and the role of Service selectors in traffic routing during a rollout.

Troubleshooting Scenarios

Live system debugging, incident diagnostics, and latency resolution.

Medium Senior Level AWS

What is AWS Route 53 and how do you implement DNS failover?

Amazon Route 53 is a scalable and highly available DNS web service that routes end users to internet applications and supports domain registration.

Key Features

DNS Resolution

Route 53 translates domain names (example.com) into IP addresses. It supports all standard DNS record types: A, AAAA, CNAME, MX, TXT, NS, SOA, and Route 53-specific alias records.

Routing Policies

Simple: Route traffic to a single resource
Weighted: Split traffic by percentage between resources (A/B testing, gradual rollouts)
Latency: Route to the region with lowest network latency
Geolocation: Route based on user’s geographic location
Geoproximity: Route based on geographic location with configurable bias
Failover: Active-passive failover routing
Multivalue Answer: Responds with up to 8 healthy records

Implementing DNS Failover

Active-Passive Failover Setup

Create Health Checks

Configure health checks for your primary endpoint (HTTP/HTTPS/TCP)
Set evaluation period, failure threshold, and interval

Create Primary Record

   Type: A
   Routing Policy: Failover
   Failover Type: Primary
   Health Check: my-primary-health-check
   TTL: 60

Create Secondary Record

   Type: A
   Routing Policy: Failover
   Failover Type: Secondary
   Value: [backup IP or S3 static site]
   TTL: 60

Failover Behavior

If primary health check fails, Route 53 routes to secondary
When primary recovers, traffic automatically returns

Active-Active Failover

Use Weighted routing with health checks:

Both endpoints active with equal weight (50/50)
Route 53 automatically removes unhealthy endpoints
Traffic redistributes to healthy endpoints

Multi-Region Failover Pattern

Route 53 (Latency routing)
├── us-east-1 ALB (Primary)
│   └── Auto Scaling Group
└── eu-west-1 ALB (Failover)
    └── Auto Scaling Group

Health Check Types

Endpoint health checks: HTTP/HTTPS/TCP checks on IP or domain
Calculated health checks: Combine results of multiple health checks
CloudWatch alarm health checks: Based on CloudWatch alarm state

Medium Senior Level Linux

How do you troubleshoot disk space issues on a Linux server?

Systematic disk investigation:

# Step 1: Check overall disk usage
df -h

# Step 2: Find which directory is consuming space
du -sh /* 2>/dev/null | sort -rh | head -20

# Step 3: Drill down into the problem directory
du -sh /var/* | sort -rh | head -10

# Step 4: Find specific large files
find / -type f -size +500M 2>/dev/null

# Step 5: Check for deleted-but-open files still consuming inodes
lsof | grep deleted

Common causes: application logs not rotating, large core dumps, MySQL/Postgres WAL overflow, old Docker images/volumes.

Medium Senior Level Linux

How do you troubleshoot high CPU usage on a Linux server?

Systematic CPU investigation:

top / htop: Identify the process consuming CPU. Note: is it user space or kernel (%us vs %sy)?
ps aux –sort=-%cpu: Snapshot of top CPU consumers.
perf top: See which kernel functions are hot.
strace -p <PID>: Trace system calls to understand what a process is doing.
vmstat 1: Observe context switches (cs) and interrupts (in).

Common causes: runaway application bug, CPU-intensive query (full table scan), kernel work from high I/O (softirqs), insufficient CPU for the workload.

Hard Lead / Architect Level Observability

How do you implement on-call rotation and incident response in an SRE team?

A mature on-call process has these elements:

Schedules: PagerDuty or OpsGenie for rotating on-call assignments with escalations.
Runbooks: Every alert links to a runbook with investigation steps and common resolutions.
Severity levels: P1 (major outage, wake anyone up) → P4 (low impact, business hours only).
Incident channels: Dedicated Slack channel per incident. Assign Incident Commander, Communications Lead roles.
Postmortems: Blameless postmortem for every P1/P2. Focus on system improvements, not blaming individuals.
On-call health: Track toil. If engineers are getting paged more than 2-3 times per shift, the alert quality needs improvement.

Easy Associate Level Observability

What is an error budget and how do SRE teams use it?

An error budget is the allowable amount of unreliability in a service, derived from the SLO. If your SLO is 99.9% uptime, your error budget is 0.1% — about 43 minutes of downtime per month.

How teams use it:

When error budget is healthy → deploy freely, take risks, ship features.
When error budget is low → slow down deployments, prioritize reliability work.
When budget is exhausted → freeze all non-critical deployments until reliability improves.

Error budgets create a shared language between product (wants to ship) and SRE (wants reliability). It’s objective, not political.

Medium Senior Level Docker

What are dangling Docker images and how do you clean them up?

Dangling images are layers that have no associated tag — they appear as <none>:<none> in docker images. They accumulate over time from rebuilds and waste disk space.

# List dangling images
docker images -f dangling=true

# Remove all dangling images
docker image prune

# Nuclear option — remove all unused images, containers, networks, volumes
docker system prune -a --volumes

In CI/CD pipelines, always run docker system prune -f as a post-step to keep agents clean.

Hard Lead / Architect Level Kubernetes

How do you troubleshoot high memory usage causing OOMKilled events in production?

When a container exceeds its memory limit, the kernel OOM killer terminates it and Kubernetes logs OOMKilled. Steps to resolve:

Identify: kubectl describe pod <pod> — look for Reason: OOMKilled in Last State.
Profile: Use kubectl top pod or Prometheus/Grafana to understand actual memory usage patterns.
Fix: Either increase limits if the app genuinely needs more memory, or find and fix the memory leak in the application code.
Prevent: Set up PrometheusRule or Datadog alerts to notify before a pod hits its limit.

Easy Associate Level Kubernetes

What are resource requests and limits in Kubernetes, and why are they important?

Requests tell the Kubernetes scheduler how much CPU/memory to reserve for a pod when scheduling it onto a node. Limits are the hard caps — the container is throttled (CPU) or killed (memory) if it exceeds them.

resources:
  requests:
    memory: "128Mi"
    cpu: "250m"
  limits:
    memory: "256Mi"
    cpu: "500m"

Always set both. Without requests, the scheduler cannot make good placement decisions. Without limits, a runaway container can starve other workloads on the same node (the “noisy neighbor” problem).

Hard Lead / Architect Level Kubernetes

How do you debug a pod stuck in CrashLoopBackOff?

CrashLoopBackOff means the container starts but repeatedly crashes. Use this systematic approach:

Check logs: kubectl logs <pod> --previous to see the crash output.
Describe the pod: kubectl describe pod <pod> to inspect Events, resource limits, and probe failures.
Check OOM: If you see OOMKilled, the container exceeded its memory limit.
Shell override: Override the entrypoint to keep the container alive for inspection: command: ["sleep", "3600"]

My Practice Workspace

No saved questions yet. Click the Save button on any question to save it here.

No recently viewed questions.