Search Pass4Sure

DevOps Interview Questions: CI/CD, Containers, and Infrastructure as Code

Comprehensive DevOps interview preparation covering CI/CD pipelines, Kubernetes, Docker, Terraform, observability, and deployment strategies with real technical depth.

DevOps Interview Questions: CI/CD, Containers, and Infrastructure as Code

DevOps roles attract a wide range of interview styles, but the technical depth converges around a consistent set of topics: continuous integration and delivery pipelines, container orchestration, infrastructure as code, observability, and the cultural principles that underpin DevOps practice. This article covers the specific technical questions and topics that appear in DevOps engineer interviews, with the framing and depth that distinguishes candidates with real operational experience.

CI/CD Pipeline Questions

Pipeline Design and Deployment Strategies

"Walk me through a CI/CD pipeline you have built or maintained."

This is a common opening question that tests depth of practical experience. Strong answers describe a real pipeline with specific tools, explain the rationale for key decisions, and address how the pipeline handles failures.

A typical pipeline for a containerized application:

Code push to feature branch
    -> Lint and static analysis (eslint, pylint, gosec)
    -> Unit tests
    -> Build Docker image
    -> Push to container registry with commit SHA tag
    -> Integration tests against ephemeral environment
    -> Security scan of image (Trivy, Snyk)
    
Merge to main
    -> All above steps
    -> Tag image as "candidate"
    -> Deploy to staging environment
    -> Run smoke tests
    -> Manual approval gate (for production)
    -> Deploy to production with blue/green or rolling strategy

The interviewer is listening for whether you mention: test coverage gates, artifact versioning, environment-specific configuration management, and rollback strategy.

"What is the difference between a rolling deployment, blue/green deployment, and canary release?"

Strategy How It Works Key Benefit Key Risk
Rolling Gradually replace old instances with new No extra infrastructure cost Both versions run simultaneously during update
Blue/Green Maintain two identical environments; switch traffic Instant rollback by switching back Requires double infrastructure cost
Canary Route small percentage of traffic to new version Validates new version with real traffic before full rollout Requires traffic splitting infrastructure

In Kubernetes, rolling is the default deployment strategy. Blue/green is commonly implemented with weighted routing in a load balancer or service mesh. Canary releases are implemented with Flagger, Argo Rollouts, or native service mesh traffic splitting.

"What is a pipeline artifact and how do you manage artifact versioning?"

An artifact is the output of a build step—a compiled binary, Docker image, JAR file, or zip package. Artifact versioning ensures that the exact build can be reproduced and traced. Common schemes: semantic versioning (1.4.2), build number, or git commit SHA. Container images should be tagged with immutable identifiers (commit SHA) rather than mutable tags like latest in production systems.

Container and Kubernetes Questions

Docker and Kubernetes Fundamentals

"Explain the difference between a Docker image and a Docker container."

An image is a read-only, layered filesystem snapshot defined by a Dockerfile. A container is a running instance of an image. Multiple containers can run from the same image simultaneously. Images are immutable; containers have a writable layer that is discarded when the container stops (unless mounted to a volume).

"What is a Kubernetes Pod and how is it different from a container?"

A Pod is the smallest deployable unit in Kubernetes—it contains one or more containers that share a network namespace and storage volumes. Containers within a Pod communicate over localhost. The main container runs the application; sidecar containers provide auxiliary functions (log shipping, service mesh proxy, credential injection). Pods are ephemeral—when a Pod dies, Kubernetes creates a new one with a different IP.

"Explain the relationship between a Deployment, ReplicaSet, and Pod in Kubernetes."

A Deployment is the high-level object that defines the desired state: which container image to run and how many replicas. A Deployment manages a ReplicaSet, which maintains the specified number of Pod replicas. When you update a Deployment (change the image), Kubernetes creates a new ReplicaSet and gradually scales it up while scaling down the old one. This is the rolling update mechanism.

# Simplified Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
      - name: web-app
        image: registry.example.com/web-app:abc123
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "500m"
            memory: "256Mi"

Resource requests and limits appear in most Kubernetes interview discussions because they affect scheduling decisions and cluster stability.

"What happens when a Pod's memory limit is exceeded?"

When a container exceeds its memory limit, the Linux OOM (Out of Memory) killer terminates the container. Kubernetes records this as an OOMKilled event. If the Pod has a restart policy, Kubernetes restarts it (with backoff). Without memory limits, a runaway container can consume all available node memory, causing node instability and cascading failures across unrelated workloads.

Infrastructure as Code Questions

Terraform State and Modules

"What is Terraform state and why is it important?"

Terraform state is a JSON file that records the mapping between your configuration and the real resources provisioned in the cloud. Terraform uses state to determine what changes to apply on the next terraform plan. Without state, Terraform cannot know which resources exist or which it manages.

Remote state storage (S3 + DynamoDB for AWS, GCS for Google Cloud) is essential for team workflows:

  • S3 stores the state file
  • DynamoDB provides state locking to prevent concurrent modifications
  • Encryption prevents credentials in state from being exposed

"What is a Terraform module and when would you create one?"

A module is a reusable, self-contained configuration that accepts inputs and produces outputs. Create a module when:

  • The same infrastructure pattern is needed in multiple places (e.g., a standard VPC layout)
  • You want to enforce organization standards (naming conventions, required tags, security defaults)
  • You want to hide complexity from teams consuming the infrastructure

Public modules from the Terraform Registry (official AWS, Azure, GCP modules) provide a starting point, but organizations typically maintain internal modules with their own conventions.

"What is configuration drift and how do you detect and prevent it?"

Configuration drift is the divergence between the desired state defined in code and the actual state of infrastructure—typically caused by manual changes made outside the IaC workflow. Detection: run terraform plan regularly (or in a scheduled pipeline) and alert on non-empty plans. Prevention: restrict who can modify infrastructure directly (IAM policies, cloud guardrails) and enforce all changes through the IaC workflow.

Observability Questions

The Three Pillars and Golden Signals

"The four golden signals—latency, traffic, errors, saturation—are the minimum viable monitoring set for any service. If you cannot answer those four questions about a system, you cannot operate it reliably." — Betsy Beyer, editor of Site Reliability Engineering (O'Reilly Media), Google SRE team

"What is the difference between logging, metrics, and tracing? When do you use each?"

Signal Type What It Captures Best For
Logs Discrete events with context Understanding what happened in a specific transaction
Metrics Numeric measurements over time Alerting on system health, capacity planning
Traces End-to-end request flow across services Diagnosing latency issues in distributed systems

The three together constitute the "three pillars of observability." A production incident typically starts with a metric alert, is investigated using logs, and—for microservices—requires tracing to identify which service in the call chain is causing the problem.

"What are the golden signals of monitoring?"

From the Google Site Reliability Engineering book, the four golden signals are:

  1. Latency: how long requests take, distinguishing successful and failed requests
  2. Traffic: how much demand the system is handling (requests per second)
  3. Errors: rate of failed requests
  4. Saturation: how close the system is to capacity (CPU, memory, queue depth)

Interviewers for SRE-adjacent roles often reference this framework and expect candidates to be familiar with it.

DevOps Culture and Process Questions

Secrets Management and Team Practices

Senior DevOps interviews include questions about team practices:

"How do you manage secrets in a CI/CD pipeline?"

Never store secrets in version control. Common patterns:

  • Environment variables injected by the CI system at runtime (GitHub Actions secrets, GitLab CI variables)
  • Integration with a secrets manager (HashiCorp Vault, AWS Secrets Manager) called at runtime
  • Short-lived credentials via cloud identity (assuming an IAM role in AWS, using Workload Identity in GCP)

The worst pattern is hardcoded credentials in code or configuration files committed to the repository. Static analysis tools like GitLeaks and truffleHog scan for committed secrets.

See also: Technical Interview Formats Explained: What to Expect at Each Stage

References

  1. Kim, G., Humble, J., Debois, P., & Willis, J. (2016). The DevOps Handbook. IT Revolution Press. ISBN: 978-1942788003
  2. Humble, J., & Farley, D. (2010). Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation. Addison-Wesley. ISBN: 978-0321601919
  3. Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). "Borg, Omega, and Kubernetes." ACM Queue, 14(1). https://queue.acm.org/detail.cfm?id=2898444
  4. HashiCorp. (2024). "Terraform Best Practices." https://developer.hashicorp.com/terraform/docs/cloud-docs/recommended-practices
  5. Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (Eds.). (2016). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media. ISBN: 978-1491929124
  6. Beyer, B., Murphy, N. R., Rensin, D. K., Kawahara, K., & Thorne, S. (2018). The Site Reliability Workbook. O'Reilly Media. ISBN: 978-1492029502
  7. Luksa, M. (2017). Kubernetes in Action. Manning Publications. ISBN: 978-1617293726

Frequently Asked Questions

What topics do DevOps engineer interviews typically cover?

DevOps interviews consistently cover CI/CD pipeline design and tooling (Jenkins, GitHub Actions, GitLab CI), container technology (Docker images, Kubernetes deployments), infrastructure as code (Terraform, CloudFormation), observability (logging, metrics, tracing), and deployment strategies. Behavioral questions about incident response and on-call experience also appear.

What is the difference between blue/green and canary deployments?

Blue/green maintains two identical environments and switches all traffic instantly from the old to the new version, enabling instant rollback by switching back. A canary release routes a small percentage of traffic to the new version, validates it with real traffic, then gradually increases the percentage. Blue/green requires double infrastructure; canary requires traffic splitting infrastructure.

What happens when a Kubernetes Pod exceeds its memory limit?

The Linux OOM killer terminates the container, and Kubernetes records an OOMKilled event. If the Pod has a restart policy (the default is Always), Kubernetes restarts the container with exponential backoff. Without memory limits, a runaway container can consume all available node memory and cause cascading failures across unrelated workloads on the same node.

What is Terraform state locking?

State locking prevents two Terraform operations from modifying state simultaneously, which could corrupt it. When using S3 as a remote backend, a DynamoDB table provides locking by recording an entry when state is in use. Any concurrent plan or apply will wait or fail rather than proceeding with potentially stale state.

What are the four golden signals for monitoring?

From the Google SRE book: Latency (how long requests take), Traffic (request rate), Errors (rate of failed requests), and Saturation (how close the system is to capacity). These four metrics provide a useful baseline for production monitoring and alerting because they cover the dimensions most likely to affect user experience.