System Design Interview Questions

Master System Design with these real-world interview questions and answers.

Switch Topic:

Beginner Questions

Core concepts, syntax, and foundational command-line knowledge.

Easy Associate Level System Design
Q:

What is the difference between authentication and authorization?

Authentication (AuthN): Verifying the identity of a user or service. “Who are you?” Authentication happens first — you prove your identity with a password, token, certificate, or biometric.

Authorization (AuthZ): Determining what an authenticated identity is allowed to do. “What can you do?” Authorization happens after authentication — once we know who you are, we check your permissions.

Example in AWS: You authenticate to AWS with your access key (AuthN). Then AWS checks your IAM policies to see if you’re authorized to call s3:PutObject (AuthZ). Both can fail independently.

Easy Associate Level System Design
Q:

What is multi-factor authentication (MFA) and why should it be enforced for cloud accounts?

MFA requires two or more verification factors: something you know (password) + something you have (TOTP app, hardware key) + something you are (biometric). Even if a password is compromised, MFA prevents unauthorized access.

For AWS/cloud accounts:

  • Enforce MFA on the root account immediately and don’t use it routinely
  • Require MFA for IAM users via SCP or IAM policy condition
  • Use hardware MFA keys (YubiKey) for privileged accounts
  • Enable AWS Organizations SCPs to deny API calls unless MFA is present
Easy Associate Level System Design
Q:

What is TLS/SSL and why is it important for DevOps engineers to understand it?

TLS (Transport Layer Security) encrypts communication between clients and servers, preventing eavesdropping and man-in-the-middle attacks. It replaced the deprecated SSL protocol.

DevOps engineers encounter TLS in:

  • Configuring HTTPS for web services (Let’s Encrypt, ACM in AWS)
  • Kubernetes Ingress TLS termination
  • mTLS between microservices (Istio, Linkerd)
  • Certificate rotation — expired certs cause outages
  • Internal PKI for service-to-service auth

Automate certificate renewal with cert-manager in Kubernetes or AWS Certificate Manager. Never let certificates expire manually.

Intermediate Questions

Infrastructure management, deployment strategies, and delivery flows.

Medium Senior Level System Design
Q:

What is a WAF and when should you use AWS WAF vs Cloudflare?

A Web Application Firewall (WAF) filters and monitors HTTP traffic to protect against common attacks: SQL injection, XSS, DDoS, bad bots.

AWS WAF: Tight integration with CloudFront, ALB, API Gateway. Managed rule groups for OWASP, AWS managed rules. Good if you’re AWS-native. Can use IP reputation lists and rate-limiting rules.

Cloudflare: Operates at the DNS/edge level before traffic reaches AWS. Better DDoS mitigation due to Cloudflare’s massive global network. Simpler setup. Bot management is more mature.

In practice: Use Cloudflare as the outer layer for DDoS and global edge, then AWS WAF at the ALB for application-layer filtering. Defense in depth.

Medium Senior Level System Design
Q:

What is a CVE, and how do you track and remediate vulnerabilities in your infrastructure?

A CVE (Common Vulnerabilities and Exposures) is a public identifier for a known security vulnerability. Each CVE has a severity score (CVSS 0-10).

Tracking and remediation workflow:

  1. Discovery: Continuous scanning — Trivy/Snyk in CI for container images, Dependabot for code dependencies, AWS Inspector for EC2.
  2. Triage: Not all CVEs require immediate action. Prioritize by CVSS score, exploitability, and whether the vulnerable code path is actually used.
  3. Remediation: Update base image, update dependency, or apply vendor patch.
  4. Tracking: Log CVEs in your ticketing system with SLA (e.g., Critical = 24h, High = 7 days).

Advanced Questions

Enterprise orchestration, deep architectural concepts, and scaling issues.

Hard Lead / Architect Level System Design
Q:

Explain the OWASP Top 10 and which items are most relevant to DevOps engineers.

The OWASP Top 10 are the most critical web application security risks. Most relevant to DevOps:

  • A01: Broken Access Control — Enforce least privilege in IAM, K8s RBAC. Verify RBAC policies in code review.
  • A05: Security Misconfiguration — Public S3 buckets, default credentials, exposed management ports. Caught by infrastructure scanning tools like Checkov, tfsec.
  • A06: Vulnerable Components — Use Dependabot and Trivy to catch outdated dependencies with known CVEs.
  • A09: Security Logging Failures — Ensure CloudTrail, K8s audit logs, and application audit logs are enabled and shipped to a SIEM.

Real Production Scenarios

Real-world architecture, system migration, and design challenges.

Hard Lead / Architect Level System Design
Q:

How do you implement security scanning in a GitHub Actions CI/CD pipeline?

A comprehensive security scanning pipeline:

jobs:
  security:
    runs-on: ubuntu-latest
    steps:
      # SAST — Static code analysis
      - uses: actions/checkout@v4
      - name: Run Semgrep
        uses: returntocorp/semgrep-action@v1

      # Dependency scanning
      - name: Run Snyk
        uses: snyk/actions/node@master
        env:
          SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}

      # Container image scanning
      - name: Build image
        run: docker build -t myapp:${{ github.sha }} .
      - name: Run Trivy
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: myapp:${{ github.sha }}
          severity: CRITICAL,HIGH
          exit-code: 1

      # IaC scanning
      - name: Run tfsec
        uses: aquasecurity/tfsec-action@v1.0.0
Medium Senior Level System Design
Q:

What is Zero Trust Architecture and how does it apply to DevOps?

Zero Trust is a security model based on “never trust, always verify.” Traditional networks trusted everything inside the perimeter. Zero trust assumes the network is already compromised.

Zero Trust principles in DevOps:

  • Identity-based access: Every service authenticates. No implicit trust based on network location.
  • Least privilege: Minimal permissions for every identity, re-evaluated regularly.
  • Micro-segmentation: Kubernetes NetworkPolicies and service meshes with mTLS between every service.
  • Device trust: Verify developer machines with fleet management (Jamf, Intune) before allowing access to internal systems.
  • Continuous verification: Short-lived credentials. Re-authenticate frequently.
Hard Lead / Architect Level System Design
Q:

How do you implement secrets rotation without downtime?

Secret rotation is a critical security practice. Zero-downtime rotation process:

  1. Generate new secret without invalidating the old one (e.g., create a new DB user, or generate a new API key that coexists with the old one).
  2. Update secret store (AWS Secrets Manager, Vault) with the new value.
  3. Rotate applications: Applications use External Secrets Operator or Vault Agent to pick up new values. Configure TTL on cached secrets so they refresh within minutes.
  4. Verify: Confirm all services are using the new secret.
  5. Revoke old secret.

AWS Secrets Manager has native rotation with Lambda functions for RDS passwords. This can be fully automated.

Medium Senior Level System Design
Q:

What is a bastion host (jump server) and what are the modern alternatives?

A bastion host is a dedicated, hardened server in a public subnet used as the only entry point for SSH/RDP into private subnet resources. All access is logged and audited.

Modern, better alternatives:

  • AWS Systems Manager Session Manager: SSH into EC2 over HTTPS through the AWS API. No open port 22 required. All sessions logged to CloudWatch/S3. IAM-controlled access.
  • Teleport: Open-source access platform with MFA, session recording, and role-based access for SSH, Kubernetes, databases, and web applications.
  • Tailscale / WireGuard: Zero-config VPN mesh that avoids exposing any servers publicly.
Hard Lead / Architect Level System Design
Q:

How do you implement network segmentation for a microservices application?

Network segmentation limits the blast radius of a compromise. In a microservices context:

  1. AWS: Security Groups + VPC design: Place services in private subnets. Use security groups to only allow traffic between services that need to communicate (e.g., allow port 5432 only from the API service to the database SG).
  2. Kubernetes: NetworkPolicies: Default-deny all inter-pod traffic. Explicitly allow only required paths.
  3. Service Mesh (Istio/Linkerd): Mutual TLS (mTLS) between all services — all communication is encrypted and authenticated at the network level. Zero-trust networking.
Easy Associate Level System Design
Q:

What is the principle of least privilege and why is it critical in DevOps?

The principle of least privilege (PoLP) states that any user, process, or service should only have the minimum permissions necessary to perform its function — nothing more.

In DevOps this applies to:

  • IAM roles: A Lambda function that reads from S3 should only have s3:GetObject on that specific bucket, not full S3 access.
  • Kubernetes RBAC: A deployment automation service account only needs update permissions on Deployments, not cluster-admin.
  • CI/CD tokens: A build token should be able to push to a registry but not manage IAM users.

Blast radius reduction: if credentials are compromised, least privilege limits what an attacker can do.

Medium Senior Level System Design
Q:

What is SAST vs DAST and where do they fit in a DevSecOps pipeline?

SAST (Static Application Security Testing): Analyzes source code without executing it. Runs early in CI (on every commit/PR). Tools: Semgrep, SonarQube, Bandit (Python), gosec (Go). Fast, no running application needed.

DAST (Dynamic Application Security Testing): Tests the running application by sending malicious inputs and analyzing responses. Runs against a deployed staging environment. Tools: OWASP ZAP, Burp Suite. Finds runtime vulnerabilities that SAST misses (SQL injection, auth bypass).

DevSecOps pipeline: SAST on PR → build image → Trivy scan → deploy to staging → DAST → promote to prod.