DevOps Interview Questions
Practice 110 real interview questions with hidden answers. Think through each one, reveal the model answer, and share the tricky ones.
// pick your level
// browse all questions
110 questions
Bootstrapping Argo CD and Letting It Manage Itself
Argo CD manages your apps. Who manages Argo CD? Walk me through how you would bootstrap it from a fresh cluster and where its own config lives in your repo.
JuniorbeginnerGitOpsStructuring a Git Repo for Argo CD Multi-Environment Deployments
How would you structure a Git repo for Argo CD when you have dev, staging, and prod environments?
JuniorbeginnerGitOpsAWS VPC Networking Fundamentals
Explain the difference between public and private subnets in AWS VPC. How do instances in private subnets access the internet?
JuniorbeginnerAWSBash Scripting Basics
What is the shebang line and how do you write a basic bash script?
JuniorbeginnerLinuxCI/CD Pipeline Stages
What are the typical stages of a CI/CD pipeline and why is each stage important?
JuniorbeginnerCI/CDCloud IAM Basics
What is IAM in cloud computing? Explain the concepts of users, roles, and policies.
JuniorbeginnerCloudCloud Regions and Availability Zones
What are cloud regions and availability zones? How do they affect application architecture?
JuniorbeginnerCloudConfiguration Management Basics
What is configuration management? Why is it important and what tools are commonly used?
JuniorbeginnerInfrastructureContainer Orchestration Basics
What is container orchestration and why do we need it? Name some common orchestration platforms.
JuniorbeginnerKubernetesCost Allocation Tags Basics
Can you explain what cost allocation tags are and why teams use them across AWS, GCP, and Azure?
JuniorbeginnerFinOpsDocker Container Basics
What is the difference between a Docker image and a container? How do they relate to each other?
JuniorbeginnerDockerDockerfile Best Practices
What is the difference between a Docker image and a container? What are some Dockerfile best practices?
JuniorbeginnerDockerEnvironment Variables
What are environment variables and how do you use them in Linux? How do you make them persistent?
JuniorbeginnerLinuxGit Staging and Committing
Explain the Git staging area. What is the difference between git add, git commit, and git push?
JuniorbeginnerGitGit Branching Strategies
What are common Git branching strategies? Describe GitFlow or trunk-based development.
JuniorbeginnerGitGit Rebase vs Merge
What is the difference between git rebase and git merge? When would you use each?
JuniorbeginnerGitGit Workflow Strategies
Describe some common Git workflows used in team environments. What are the pros and cons of each?
JuniorbeginnerGitHTTP Methods and Status Codes
What are the main HTTP methods and what do common status codes like 200, 404, and 500 mean?
JuniorbeginnerNetworkingInternal Developer Platform Purpose
Your team keeps filing tickets for things like creating new services, setting up databases, and getting access to staging environments. Your CTO asks you to fix this. What would you build, and why?
JuniorbeginnerPlatform EngineeringVirtualService vs DestinationRule
In Istio, what's the difference between a VirtualService and a DestinationRule? When would you use each?
JuniorbeginnerService MeshLinux File Permissions
Explain Linux file permissions. What does the permission 'rwxr-xr--' mean?
JuniorbeginnerLinuxUsing grep for Text Search
How do you use grep to search for text patterns in files? What are some useful flags?
JuniorbeginnerLinuxLinux Package Management
How do you install and manage software packages in Linux? What's the difference between apt and yum?
JuniorbeginnerLinuxLinux System Logs
Where are system logs stored in Linux and how do you view them?
JuniorbeginnerLinuxLitmus Building Blocks: ChaosEngine vs ChaosExperiment
You install Litmus on a cluster and want to kill a pod to see what happens. Walk me through the pieces Litmus gives you, and what is the actual difference between a ChaosExperiment and a ChaosEngine?
JuniorbeginnerChaos EngineeringFour Golden Signals of Monitoring
What are the four golden signals of monitoring and why are they important?
JuniorbeginnerMonitoringDNS Basics
What is DNS and how does the DNS resolution process work step by step?
JuniorbeginnerNetworkingOn-Call Rotation and Escalation Basics
You're about to go on-call for the first time. In your own words, what is an on-call rotation, and why do teams bother setting up a formal escalation policy instead of just pinging whoever happens to be online when something breaks?
JuniorbeginnerIncident ManagementTraces and Spans Explained
A request hits your API gateway, which calls two backend services, and one of those queries a database. Walk me through what that looks like as a distributed trace. What is a span, and how do spans connect to each other?
JuniorbeginnerObservabilityProgressive Delivery Basics
What is progressive delivery and how does it differ from traditional continuous delivery?
JuniorbeginnerCI/CDShell Scripting Fundamentals
What are the essential components of a shell script? Explain variables, conditionals, and loops.
JuniorbeginnerLinuxSLO vs SLI vs SLA Differences
Your team just launched a new API service. Your manager asks you to set up SLOs for it. Can you walk me through what SLOs, SLIs, and SLAs are, and how they relate to each other?
JuniorbeginnerSRESSH Basics and Key Authentication
How does SSH key authentication work? How do you set it up?
JuniorbeginnerLinuxDNS Resolution When You Type a URL
Walk me through what happens when you type a URL and press Enter, focusing specifically on the DNS resolution process.
JuniorbeginnerSystem DesignInstant Credit Card Validation
How does a credit card form validate numbers instantly, before even contacting the bank?
JuniorbeginnerSystem DesignTCP/IP Fundamentals
Explain the difference between TCP and UDP. When would you use each?
JuniorbeginnerNetworkingYAML and JSON Configuration Formats
What are YAML and JSON? When would you use each format in DevOps?
JuniorbeginnerDevOpsActivating Tags in Billing Reports
Your team tagged every resource with team and environment, but the tags are not showing up in cost reports. What is going on, and how is this different on AWS, GCP, and Azure?
MidintermediateFinOpsApplication Code vs Kubernetes Manifests in Separate Repos
Should your application source code and Kubernetes manifests live in the same repo or in separate repos? What is your take?
MidintermediateGitOpsHelm vs Kustomize for Per-Environment Config
Helm or Kustomize for handling environment differences in an Argo CD repo? Walk me through how you would pick and what the repo layout looks like for each.
MidintermediateGitOpsUsing Kustomize Overlays for Per-Environment Config
Walk me through how you would use Kustomize overlays to handle config that differs between dev and prod in an Argo CD setup.
MidintermediateGitOpsOrganizing a Shared Manifests Repo for Multiple Teams
Three teams share one Argo CD instance and one manifests repo. How do you lay it out so each team can ship without waiting on each other and can't accidentally deploy each other's stuff?
MidintermediateGitOpsHandling Secrets in an Argo CD Manifests Repo
Secrets can't go into Git in plaintext. How do you handle them in an Argo CD setup, and what does that look like in your repo structure?
MidintermediateGitOpsBlue-Green Deployment Strategy
What is a blue-green deployment, and what are its advantages and disadvantages compared to other deployment strategies?
MidintermediateCI/CDCI/CD Pipeline Design
How would you design a CI/CD pipeline for a microservices application? What stages would you include?
MidintermediateCI/CDDatabase Backup and Recovery
Describe database backup strategies and how you would design a recovery plan for production databases.
MidintermediateInfrastructureDesigning an On-Call Schedule
You've got six engineers split across two time zones and you need 24/7 coverage. How would you actually design the rotation? Walk me through the trade-offs you'd weigh.
MidintermediateIncident ManagementDocker Image Layers and Caching
How do Docker image layers work, and how can you optimize your Dockerfile to take advantage of layer caching?
MidintermediateDockerEnforcing Tagging Policies Across Clouds
How do you actually enforce that every resource in AWS, GCP, and Azure gets the required tags? Walk me through what you would put in place.
MidintermediateFinOpsDesigning an Escalation Policy
An alert fires at 3am and pages the primary on-call. Walk me through what your escalation policy should do from that moment, step by step, and tell me what failure modes you're designing around.
MidintermediateIncident ManagementEssential Tags for Multi-Cloud Cost Allocation
If you were designing a tagging standard for a company running on AWS, GCP, and Azure, which tags would you require on every resource and why?
MidintermediateFinOpsGitOps Principles and Implementation
What is GitOps and how does it differ from traditional CI/CD? Explain the pull-based deployment model.
MidintermediateDevOpsDesigning Golden Paths
We want our developers to follow best practices, but we do not want to slow them down with mandatory reviews for every infrastructure change. How would you design 'golden paths' in an internal developer platform?
MidintermediatePlatform EngineeringService Catalog and Ownership
You are building a service catalog for your IDP. Your company has 200 microservices, and nobody knows who owns half of them. How do you fix this, and what does a good service catalog look like?
MidintermediatePlatform EngineeringImmutable Infrastructure
Explain immutable infrastructure and its benefits. How does it differ from traditional server management?
MidintermediateInfrastructureInfrastructure as Code Patterns
What are the key principles and patterns of Infrastructure as Code? How do you structure IaC for multiple environments?
MidintermediateInfrastructureIstio Retries and Retry Amplification
How do you configure retries in Istio, and what's the danger of being too aggressive with them?
MidintermediateService MeshWeighted Canary Rollout with Istio
Walk me through how you'd canary a new version of a service with Istio. Say you want to start at 5% traffic to v2 and ramp up.
MidintermediateService MeshKubernetes Kubelet
What is the role of the kubelet in a Kubernetes cluster? How does it interact with the control plane?
MidintermediateKubernetesKubernetes Pod Lifecycle
Explain the different phases of a Kubernetes Pod lifecycle and what happens during each phase.
MidintermediateKubernetesKubernetes Services and Networking
Explain the different types of Kubernetes Services (ClusterIP, NodePort, LoadBalancer) and when to use each.
MidintermediateKubernetesRunning Your First Pod-Delete Experiment Safely
I hand you a fresh cluster with a demo nginx deployment. Take me from nothing to a controlled pod-delete experiment. What are the steps, and how do you keep it from turning into an outage?
MidintermediateChaos EngineeringHow Litmus Decides Pass or Fail: Probes
Your pod-delete experiment shows Pass, but during the run users got 502s for about 20 seconds. How can the experiment pass while the service was actually down, and how do you fix that?
MidintermediateChaos EngineeringLog Aggregation Strategies
How do you implement centralized logging in a distributed system? What are the key components?
MidintermediateMonitoringMonitoring and Alerting Strategy
How do you design a monitoring and alerting strategy? What metrics would you track and how do you avoid alert fatigue?
MidintermediateObservabilityAuto vs Manual Instrumentation
You need to roll out tracing across 40 services owned by six different teams. Do you go with auto-instrumentation or manual instrumentation, and how do you decide?
MidintermediateObservabilityContext Propagation Across Services
Service A calls service B over HTTP. How does B know the request belongs to an existing trace? What actually travels on the wire?
MidintermediateObservabilityApplication Performance Optimization
How do you identify and resolve performance bottlenecks in a production application?
MidintermediateSRECanary Releases in Progressive Delivery
You're deploying a new version of a critical payment service. Walk me through how you'd set up a canary release for it.
MidintermediateCI/CDFeature Flag Types and Use Cases
Can you walk me through the different types of feature flags and when you'd use each one?
MidintermediateCI/CDSecrets Management
How do you securely manage secrets (passwords, API keys, certificates) in a DevOps environment?
MidintermediateSecurityService Mesh Concepts
What is a service mesh and when would you implement one? Explain the sidecar pattern.
MidintermediateKubernetesSLI, SLO, and SLA Definitions
Explain the difference between SLI, SLO, and SLA with examples.
MidintermediateSREChoosing the Right SLIs
You're joining a team that runs a checkout service for an e-commerce platform. There are no SLOs yet. How would you decide which SLIs to track?
MidintermediateSREError Budget Management
Your service has a 99.9% availability SLO over a 30-day window. How much downtime does that give you, and what do you actually do with that error budget day-to-day?
MidintermediateSREUsername Availability with Bloom Filters
Explain how you'd check username availability for a service with billions of users without hitting the database on every keystroke.
MidintermediateSystem DesignCDN Image Delivery Under 50ms
Explain how a CDN serves images to users worldwide in under 50ms.
MidintermediateSystem DesignTerraform State Management
What is Terraform state, why is it important, and how do you manage state in a team environment?
MidintermediateTerraformApp-of-Apps Pattern vs ApplicationSets
Explain the app-of-apps pattern in Argo CD. When would you pick it over ApplicationSets?
SenioradvancedGitOpsPromoting Changes from Dev to Staging to Prod with Argo CD
Walk me through how a change moves from dev to staging to prod in an Argo CD setup. How do you stop something from slipping into prod that should not be there?
SenioradvancedGitOpsScaling a Manifests Repo Across Many Services, Environments, and Clusters
You have 15 microservices, 4 environments (dev, staging, preprod, prod), and prod runs in 3 regional clusters. That's potentially 90 Application CRs. How do you structure the manifests repo so it doesn't fall apart?
SenioradvancedGitOpsCapacity Planning and Scaling
How do you approach capacity planning for a growing production system? What metrics and strategies do you use?
SenioradvancedSREChaos Engineering Practices
What is chaos engineering and how would you implement it safely in a production environment?
SenioradvancedSREChargeback Design for Shared and Untaggable Costs
You are designing a chargeback system for 200 teams running across AWS, GCP, and Azure. How do you handle the 15 to 25 percent of the bill that cannot be tagged to a single team, like inter-AZ data transfer, support plans, NAT gateways, and shared Kubernetes clusters?
SenioradvancedFinOpsCloud Cost Optimization
How do you approach cloud cost optimization? What strategies and tools would you use?
SenioradvancedCloudCompliance and Governance in Cloud
How do you implement compliance and governance controls in a cloud-native environment?
SenioradvancedSecurityDisaster Recovery Planning
How do you design a disaster recovery strategy? Explain RPO, RTO, and different DR approaches.
SenioradvancedArchitectureFinOps and Cloud Cost Management
How do you implement FinOps practices to optimize and manage cloud costs at scale?
SenioradvancedCloudMeasuring IDP Success and Adoption
You have spent six months building an internal developer platform. Your VP of Engineering asks: 'Is this thing actually working? How do we know it was worth the investment?' What do you show them?
SenioradvancedPlatform EngineeringPlatform API and Orchestration Layer
You are designing the orchestration layer for your IDP. A developer clicks 'Create New Service' and behind the scenes it needs to create a GitHub repo, provision a database, set up a CI/CD pipeline, configure monitoring, and register the service in the catalog. How do you architect this?
SenioradvancedPlatform EngineeringIncident Postmortems
Describe a production incident you handled and how you structured the postmortem. What makes a good blameless postmortem?
SenioradvancedSREIstio Circuit Breakers and Outlier Detection
How do you implement a circuit breaker in Istio? Explain the difference between the connection pool limits and outlier detection.
SenioradvancedService MeshDebugging an Istio Traffic Policy That Isn't Working
You applied a VirtualService that splits traffic 80/20 between v1 and v2 of a service, but in production all traffic still goes to v1. Walk me through how you'd debug it.
SenioradvancedService MeshLinux Process Debugging
A process is consuming 100% CPU on a Linux server. Walk me through how you would identify and troubleshoot this issue.
SenioradvancedLinuxFrom One Experiment to Continuous Chaos at Scale
You have proven a single pod-delete works. Leadership now wants chaos as an ongoing practice across dozens of services and several clusters, not a one-off demo. What changes, and what would you do differently at scale?
SenioradvancedChaos EngineeringScoping Litmus Safely: RBAC and Blast Radius
Your security team sees that Litmus can delete pods and inject network faults cluster-wide, and they want it gone. How do you scope Litmus so you can still run chaos in production without handing it the keys to the cluster?
SenioradvancedChaos EngineeringMulti-Cloud Architecture
When would you recommend a multi-cloud strategy? What are the challenges and how do you address them?
SenioradvancedArchitectureWhy Run an OpenTelemetry Collector
Why run an OpenTelemetry Collector at all instead of having every application export directly to your tracing backend? And how would you deploy it in Kubernetes?
SenioradvancedObservabilitySampling Strategies at Scale
Your platform handles 50,000 requests per second and tracing every one of them is blowing up the observability bill. How do you approach sampling, and what is the tradeoff between head and tail sampling?
SenioradvancedObservabilityPlatform Team Scaling and Processes
How do you scale a platform/DevOps team to support a growing engineering organization?
SenioradvancedDevOpsFeature Flag Architecture at Scale
Your team has 200+ microservices and wants to adopt feature flags across all of them. How would you design the feature flag infrastructure?
SenioradvancedCI/CDProgressive Delivery Rollback Strategy
Your team just enabled a new feature for 20% of users via feature flags, and your monitoring shows a 3x increase in p99 latency for those users. Walk me through exactly what you'd do in the next 10 minutes.
SenioradvancedCI/CDReducing On-Call Alert Fatigue
Your on-call engineers are burning out. They're getting 40 to 50 pages a shift and they tell you most of it is noise they just ack and ignore. How do you fix this?
SenioradvancedIncident ManagementScaling an On-Call Program Across Many Teams
You've been asked to design the on-call program for an org that grew from one team to fifteen in a year. Right now it's a free-for-all. What does a healthy on-call program look like at that scale, and how would you measure whether it's working?
SenioradvancedIncident ManagementSecurity Architecture and DevSecOps
How do you integrate security into the DevOps pipeline? Describe the key components of a secure architecture.
SenioradvancedSecurityError Budget Burn Investigation
It's Monday morning. You check the dashboard and see that your service burned 80% of its monthly error budget over the weekend. Walk me through how you'd investigate this and what you'd do next.
SenioradvancedSRESLO-Based Alerting and Burn Rates
Traditional alerting fires when error rate crosses a static threshold, like 'alert if errors > 1%'. What's wrong with that approach, and how would you set up SLO-based alerting instead?
SenioradvancedSRESearch Autocomplete System Design
Design the backend for a search autocomplete system that returns suggestions within 100ms as the user types.
SenioradvancedSystem DesignSystem Design for Reliability
How would you design a highly available web application? What components and patterns would you use?
SenioradvancedArchitectureZero Trust Architecture
What is Zero Trust Architecture and how do you implement it in a modern infrastructure?
SenioradvancedSecurity
// browse by topic
Prefer scored assessments? Try the DevOps quizzes or drill concepts with flashcards.