You've set up your Kubernetes deployments with care—resource requests, readiness probes, rolling update strategies. Everything looks good in the YAML. But weeks later, pods start failing in strange ways. Some are evicted, some restart randomly, and a few just hang. You check the manifests—they haven't changed. Yet the cluster behaves differently. What happened?
The answer is pod drift: the gradual, often invisible divergence between your declared configuration and what actually runs. It's not a single catastrophic failure but a series of small misalignments that compound over time. In this guide, we'll walk through three little-known causes of pod drift that break deployment peace, how to spot them, and how to fix them before they cause real damage.
1. The Hidden Cost of Resource Limits: When Requests and Limits Don't Match
One of the first things teams configure is CPU and memory requests and limits. The pattern is simple: set requests for guaranteed resources, set limits to cap usage. But the relationship between these values and how the scheduler, kubelet, and kernel treat them is often misunderstood.
How Resource Limits Actually Work
When you set a CPU limit, Kubernetes throttles the container if it exceeds that limit. But throttling is not the same as eviction. The kubelet uses cgroups to enforce limits, and the kernel's CPU quota mechanism can cause severe latency spikes even when average usage is below the limit. This is especially problematic for latency-sensitive applications like web servers or real-time APIs.
The Drift Scenario: Unintended Throttling
Consider a team that sets CPU request to 500m and limit to 1000m for a web service. During a traffic spike, the container bursts to 800m—well within the limit. But the kernel's Completely Fair Scheduler (CFS) quota may throttle the container if it uses its quota in short bursts, leading to increased response times. The team sees degraded performance but no obvious resource exhaustion. They might scale up replicas, but the root cause is the limit itself, not the request.
Memory limits are even trickier. When a container hits its memory limit, the kernel's OOM killer terminates processes inside the container. This can cause partial failures—some requests succeed, others get connection resets. The pod might restart, but the new pod inherits the same limit, creating a cycle of OOM kills. The deployment appears to be running, but its availability is compromised.
The fix is not to remove limits—that can lead to noisy neighbors. Instead, align limits with realistic burst patterns. Use tools like Vertical Pod Autoscaler (VPA) in recommendation mode to understand actual usage. Set CPU limits only when necessary for QoS guarantees, and consider using a higher limit for memory to avoid OOM kills. Also, monitor throttling metrics (container_cpu_cfs_throttled_seconds_total) to detect drift early.
2. The Silent Default Override: Admission Controllers That Change Your Specs
You write a clean deployment YAML with minimal fields. But when you apply it, the actual pod spec includes extra sidecars, different resource limits, or modified security contexts. This is often the work of admission controllers—webhooks or policies that mutate or validate resources before they are persisted.
Common Culprits: MutatingWebhookConfigurations
Many clusters run admission controllers like Istio's sidecar injector, the Kyverno policy engine, or custom mutating webhooks. These tools are powerful but can silently change pod specs. For example, a mutating webhook might add a resource limit of 512Mi memory to every container, overriding your carefully tuned 256Mi request. Or it might inject a sidecar that consumes CPU and memory you didn't account for.
The Drift Scenario: Unseen Sidecars
Imagine a team that deploys a service mesh sidecar (e.g., Envoy) via a mutating webhook. The sidecar adds 256Mi memory overhead per pod. The team's resource requests for the main container are 128Mi memory. The pod's total memory request becomes 384Mi, but the limit might remain at 256Mi (from the main container). The sidecar's limit is unset, so it can burst to node capacity, causing the main container to be OOM-killed when the node runs low. The team sees random OOM kills and assumes the application has a memory leak, but the real issue is the sidecar's unconstrained memory usage.
To prevent this drift, audit all mutating webhooks in your cluster. Use `kubectl describe mutatingwebhookconfiguration` to see what they modify. Compare the dry-run output (`kubectl apply --dry-run=server -f pod.yaml`) with the actual running pod spec. Set resource limits on sidecars explicitly, either in the webhook configuration or by patching the deployment. Also, enable admission webhook logging to track changes over time.
3. The Scheduling Drift: Pod Disruption Budgets That Don't Protect
Pod Disruption Budgets (PDBs) are meant to ensure a minimum number of pods are available during voluntary disruptions like node drains or cluster upgrades. But misconfigured PDBs can cause scheduling drift that makes deployments brittle.
How PDBs Interact with Scheduling
A PDB with `minAvailable: 2` ensures at least 2 pods of a deployment are always running. But if the deployment has 3 replicas and a node drain triggers eviction of 1 pod, the PDB prevents the remaining 2 from being evicted. However, if the cluster autoscaler scales down a node, it may choose to evict a pod from a deployment with a PDB that is already at its minimum—blocking the scale-down. This can lead to nodes with low utilization that cannot be removed, causing resource fragmentation.
The Drift Scenario: Stale PDBs
Consider a deployment that originally had 5 replicas with a PDB of `minAvailable: 3`. Over time, the team scales down to 3 replicas for cost savings. The PDB still requires 3 available pods, meaning no pod can be evicted voluntarily. If a node needs maintenance, the drain will hang because the PDB cannot be satisfied. The team might force-delete pods, causing downtime. Or the PDB might prevent the cluster autoscaler from removing a node, leading to higher costs.
Another subtle case: PDBs with `maxUnavailable: 1` combined with a rolling update strategy that sets `maxSurge: 0`. During a rolling update, the controller must terminate an old pod before creating a new one. But the PDB blocks the termination because it would drop available pods below the threshold. The update stalls indefinitely. The team sees the deployment as stuck, but the logs show no errors—just a waiting state.
To avoid this, review PDBs whenever you change replica counts. Use `kubectl describe pdb` to check current status. Consider using `maxUnavailable` instead of `minAvailable` for deployments with flexible scaling. Also, set `maxSurge` to at least 1 to allow rolling updates to proceed even with PDBs. Monitor for PDB-related blocking events using cluster events.
4. Anti-Patterns That Make Drift Worse
When teams encounter drift, they often reach for quick fixes that compound the problem. Here are three anti-patterns to avoid.
Anti-Pattern 1: Overriding Limits with Namespace Defaults
Some teams set default resource limits at the namespace level using LimitRange resources. While this ensures every pod has limits, it can override per-pod configurations. If a developer sets a request of 100m CPU but the namespace default limit is 500m, the pod gets a limit of 500m—potentially allowing it to consume more than intended. The drift is silent: the pod runs fine until it bursts and causes node pressure.
Better approach: Use LimitRange only for minimum requests, not maximum limits. Or use VPA to set recommendations, not hard limits. Always validate that the final pod spec matches expectations.
Anti-Pattern 2: Relying on Cluster Autoscaler to Fix Overcommit
If pods are overcommitted (requests sum exceeds node capacity), the cluster autoscaler will add nodes. But this masks the real problem: pods are using more resources than requested. Over time, the cluster grows larger than necessary, increasing costs. The drift is in resource utilization, not in pod specs. To detect this, compare node utilization metrics with pod requests. Use tools like Goldilocks to right-size requests.
Anti-Pattern 3: Manual Edits to Running Pods
When a pod fails, some engineers exec into the container and change configuration files or restart processes. This creates a divergence between the running pod and the deployment manifest. The next time the pod restarts (due to a node issue or update), the changes are lost—but the team might not realize it. This leads to inconsistent behavior across replicas. The fix is to never modify running pods directly; always update the deployment and let the controller handle the rollout.
5. Long-Term Costs of Ignoring Pod Drift
Pod drift doesn't just cause immediate failures; it erodes system reliability over time. Here are the hidden costs.
Increased Mean Time to Recovery (MTTR)
When a deployment fails unexpectedly, teams spend hours debugging because the running state differs from the declared state. They check manifests, find nothing wrong, and then dig into logs and metrics. If they don't suspect drift, they may chase red herrings—like assuming a memory leak when it's actually OOM due to a missing sidecar limit. This increases MTTR and incident fatigue.
Configuration Sprawl
As drift accumulates, teams create workarounds: extra init containers, custom scripts, or manual overrides. These become part of the deployment process but are not documented. New team members inherit a system that works for unknown reasons. Configuration becomes fragile, and any change risks breaking something.
Cost Inefficiency
Drift often leads to over-provisioning. For example, if a PDB prevents node scale-down, you pay for idle capacity. If resource limits are too high, nodes fill up faster, triggering autoscaling. A 10% drift in resource requests can add 10-20% to your cloud bill over a year. And because the drift is gradual, it's hard to attribute the cost increase to a specific change.
To quantify drift costs, track the difference between requested and actual resource usage per namespace. Use tools like Kubecost or OpenCost to identify namespaces with high over-provisioning. Set budget alerts for cost anomalies.
6. When Not to Use These Fixes
Not every configuration should be hardened against drift. Some scenarios call for flexibility.
Short-Lived or Batch Workloads
For jobs that run for minutes and are idempotent, drift is less critical. A batch job that gets OOM-killed can be retried. The overhead of auditing admission controllers or setting precise limits may not be worth it. Focus on making these jobs restartable rather than drift-proof.
Development Clusters
In dev environments, rapid iteration and experimentation are more important than stability. Mutating webhooks that add debugging sidecars are fine. Resource limits can be loose. The priority is to reduce friction, not to enforce consistency. But be aware that patterns from dev can drift into production if not reviewed.
Clusters with Immutable Infrastructure
If your cluster is rebuilt from scratch for every deployment (e.g., using GitOps with ArgoCD and a fresh cluster per environment), drift is minimized because the entire state is recreated. In this model, you can afford to skip some drift detection—but still monitor for admission controller changes.
The key is to match your drift prevention effort to the criticality of the workload. For mission-critical services, invest in automated validation. For ephemeral experiments, accept some drift as a trade-off for speed.
7. Open Questions / FAQ
How do I detect pod drift early?
Use a combination of dry-run comparisons, admission webhook logging, and continuous validation tools like Conftest or OPA Gatekeeper. Compare the output of `kubectl get pod
Can I prevent drift with GitOps?
GitOps tools like ArgoCD or Flux keep the cluster state in sync with a Git repository. They can detect drift and revert changes. However, they don't prevent admission controllers from mutating resources. You still need to audit webhooks. GitOps is a safety net, not a cure-all.
What's the single most effective step to reduce drift?
Audit your mutating admission webhooks. Run `kubectl get mutatingwebhookconfigurations` and review each one. Disable any that are unnecessary. For essential webhooks, ensure they set resource limits on injected sidecars and log changes. This alone can eliminate a large class of drift issues.
Should I use PodSecurityPolicy (PSP) to prevent drift?
PSP is deprecated and not flexible enough for resource limits. Instead, use OPA Gatekeeper or Kyverno to enforce policies that prevent drift, such as requiring resource limits on all containers or blocking webhooks that modify certain fields.
8. Summary + Next Experiments
Pod drift is a silent killer of deployment peace. The three pitfalls we covered—misaligned resource limits, silent admission controller overrides, and stale PDBs—are common but often overlooked. By understanding how they work, you can detect and prevent them before they cause outages.
Here are three specific actions to take this week:
- Run a diff between your deployment manifests and the actual running pods for a critical service. Use `kubectl diff` or a tool like kube-diff. Note any discrepancies and trace them to their source.
- Audit all mutating webhooks in your cluster. For each, document what fields it modifies and whether it sets resource limits on sidecars. If a webhook is not essential, disable it.
- Review PDBs for deployments that have changed replica counts in the last month. Update `minAvailable` or `maxUnavailable` to match current replicas. Test rolling updates with a dry-run.
Deployment peace isn't about perfect configuration—it's about knowing that what you declared is what actually runs. With these checks, you'll catch drift early and keep your pods where they belong.
Comments (0)
Please sign in to post a comment.
Don't have an account? Create one
No comments yet. Be the first to comment!