Stop Losing Sleep Over Orchestration: 6 Container Pitfalls to Fix Now

Why Your Container Orchestration Is Keeping You Up at Night

Container orchestration platforms like Kubernetes and Docker Swarm promise automated deployment, scaling, and management of containerized applications. Yet for many teams, the reality is a series of late-night incidents, mysterious crashes, and performance degradation. The promise of “set it and forget it” quickly becomes a burden of constant troubleshooting. You’ve probably experienced it: a deployment that worked in staging breaks in production, pods restarting without explanation, or a sudden spike in resource usage that brings the cluster to its knees. These issues are not random; they stem from common, avoidable pitfalls in how we configure and operate orchestration systems.

The Hidden Cost of Misconfiguration

Consider a typical scenario: a team launches a microservices architecture on Kubernetes. They skip setting resource requests and limits, assuming the cluster has enough capacity. During a traffic spike, a memory-hungry service consumes all available RAM, causing the node to swap and eventually crash. The result? Cascading failures across dependent services, lost revenue, and a frantic 2 a.m. rollback. This is not a hypothetical; it’s a pattern that repeats across organizations of all sizes. According to industry surveys, resource misallocation is among the top causes of container-related incidents, leading to both performance issues and unnecessary cloud costs.

Why Traditional Monitoring Falls Short

Traditional monitoring tools designed for virtual machines often fail in container environments. Containers are ephemeral, and their metrics change rapidly. Without proper instrumentation, you might miss early warning signs like gradual memory leaks or throttled CPU. Many teams rely on default dashboards that show cluster-level averages, hiding individual pod problems. This creates a false sense of security until a pod repeatedly fails health checks and gets evicted. The solution is not just better monitoring, but a shift in how you think about observability for dynamic workloads.

This article addresses six critical pitfalls that can turn your orchestration dream into a nightmare. Each section explains the mistake, why it happens, and how to fix it with concrete steps. By the end, you’ll have a clear action plan to stabilize your clusters and regain your nights. Let’s start with the foundation: resource management.

Pitfall #1: Ignoring Resource Requests and Limits

The most common and damaging mistake in container orchestration is failing to set CPU and memory requests and limits. Without these constraints, containers can consume as many resources as they want, leading to node instability, unfair scheduling, and unpredictable performance. This pitfall is especially dangerous in multi-tenant clusters where one noisy neighbor can starve others. Let’s explore why this happens and how to fix it.

What Are Requests and Limits?

In Kubernetes, a “request” is the minimum amount of a resource that a container is guaranteed. The scheduler uses requests to decide which node can fit the pod. A “limit” is the maximum a container can use; if it exceeds this, it may be throttled (CPU) or terminated (memory). Without setting these values, the scheduler has no guidance, and containers can burst uncontrollably. For example, a Java application with a default heap size of 2 GB might start using 4 GB under load, starving other pods on the same node. Setting a memory limit of 2 GB would cause the kernel OOM killer to terminate the container, but at least the node remains stable.

How to Set Appropriate Values

Setting requests and limits requires understanding your application’s resource profile. Start by running the container in isolation with realistic traffic and monitoring its peak CPU and memory usage. Tools like `kubectl top` or Prometheus can help. For production, set requests to the 95th percentile of observed usage and limits to 1.5x to 2x that value. For example, if your app uses 500m CPU on average, set requests to 500m and limits to 1 CPU. This gives headroom for spikes while preventing runaway consumption. Remember that memory limits are absolute: if a process exceeds its memory limit, it gets killed. So set memory limits generously but realistically.

Common Mistakes and Mitigations

One mistake is setting limits too low, causing frequent OOM kills. Another is setting requests too high, wasting resources. A balanced approach is to use Vertical Pod Autoscaler (VPA) to recommend values based on historical usage. VPA can automatically adjust requests and limits, but it requires careful testing to avoid service disruptions. For stateful applications, consider using Guaranteed QoS class by setting requests equal to limits, which ensures the pod is never evicted under resource pressure. For batch jobs, you might set lower requests to allow overcommitment and higher limits for bursts.

In summary, resource requests and limits are not optional. They are essential for cluster stability, fair scheduling, and cost control. Take the time to profile your applications and configure these values properly. Your future self—and your on-call team—will thank you.

Pitfall #2: Neglecting Liveness and Readiness Probes

Health checks are the eyes and ears of your orchestration system, yet many teams skip or misconfigure them. Liveness probes determine if a container is running; readiness probes determine if it can serve traffic. Without proper probes, Kubernetes cannot react to failures, leading to downtime and degraded user experience. This section explains how to implement probes effectively and common traps to avoid.

Why Probes Matter

Imagine a web application that becomes unresponsive due to a deadlock. Without a liveness probe, Kubernetes assumes the container is healthy because the process is still running. Users see timeouts, but the orchestrator does nothing. With a liveness probe that checks an HTTP endpoint, Kubernetes detects the failure and restarts the container, restoring service. Similarly, a readiness probe prevents a newly started pod from receiving traffic until it has fully initialized, avoiding “503 Service Unavailable” errors during rolling updates. These probes are not optional; they are the foundation of self-healing.

How to Configure Probes Correctly

For a typical HTTP service, use an HTTP GET probe on a dedicated health endpoint (e.g., `/healthz`). This endpoint should perform a lightweight check of dependencies (database, cache) without doing heavy work. Avoid using the main application endpoint, as it might return a 200 even if the app is degraded. For example, if your app returns 200 for `/` but the database is down, a readiness probe on `/healthz` should return 503. Set the initial delay to account for startup time (e.g., 10 seconds for a Java app). Set period seconds to 10 for fast detection, and failure threshold to 3 to avoid flapping.

Common Misconfigurations

A frequent mistake is using the same endpoint for both liveness and readiness. This can cause unnecessary restarts during brief slowdowns. For instance, if a container is temporarily overloaded, the liveness probe might fail, causing a restart that makes things worse. Instead, use a liveness probe that only checks process health (e.g., a TCP socket check) and a readiness probe that checks application readiness. Another mistake is setting the initial delay too short, causing probes to fail before the app is ready, leading to a crash loop. Always test probes in staging with realistic load.

Probes are your first line of defense against silent failures. Invest time in configuring them correctly, and you’ll reduce mean time to recovery (MTTR) significantly. Your cluster will automatically heal, and you can sleep better knowing that common failures are handled without manual intervention.

Pitfall #3: Mismanaging Secrets and Configurations

Containerized applications need access to sensitive data like API keys, passwords, and certificates. A common pitfall is storing these secrets in plaintext inside container images or environment variables. This practice creates security vulnerabilities and makes rotation difficult. This section covers how to manage secrets securely using orchestration-native tools and best practices.

The Risks of Hard-Coded Secrets

If you bake a secret into a Docker image, anyone with access to the image can extract it. Even worse, if the image is pushed to a public registry, the secret is exposed globally. Environment variables are slightly better but still visible in the container runtime configuration and logs. A compromised container could leak these values. For example, a developer might hardcode a database password in a deployment YAML file committed to a Git repository. This password then persists in version history, accessible to anyone with repo access. The consequences can be severe, including data breaches and compliance violations.

Using Kubernetes Secrets

Kubernetes provides a `Secret` resource to store sensitive data. Secrets are base64-encoded by default (not encrypted), so they should be combined with encryption at rest using a tool like `kubectl create secret generic` or a sealed secrets controller. Mount secrets as files or inject them as environment variables. File mounts are preferred because they update automatically when the secret changes (though pods need to be restarted to pick up changes). For example, create a secret called `db-credentials` with keys `username` and `password`, then mount it at `/etc/secrets`. Your application reads the files at startup.

Best Practices for Secret Management

Use a secrets management tool like HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault integrated with your orchestrator via sidecar containers or CSI drivers. These tools provide dynamic secrets, audit logging, and automatic rotation. For Kubernetes, consider using External Secrets Operator to sync secrets from external stores into Kubernetes Secrets. Avoid storing secrets in version control; use tools like HelmSecrets or Mozilla SOPS to encrypt them before committing. Implement regular rotation policies and ensure that secrets are scoped to the minimum necessary permissions.

Proper secret management reduces the blast radius of a compromise and simplifies compliance with regulations like GDPR or SOC 2. Treat secrets as critical infrastructure, and use dedicated tools to handle them. Your security team will appreciate the effort.

Pitfall #4: Overlooking Network Policies and Security Contexts

Container orchestration platforms assume a flat network by default, meaning any pod can communicate with any other pod. This is convenient for development but dangerous in production. Without network policies, a compromised container can attack other services or exfiltrate data. Similarly, containers often run with unnecessary privileges, increasing the risk of host compromise. This section explains how to implement defense in depth.

The Need for Network Segmentation

In a microservices architecture, you typically have public-facing services, internal APIs, and databases. Ideally, only the frontend should talk to the backend, and only the backend should talk to the database. Without network policies, any pod can reach the database directly. If a frontend pod is compromised, an attacker can access sensitive data. Kubernetes Network Policies allow you to define ingress and egress rules using labels and port numbers. For example, allow traffic from `app: frontend` to `app: backend` on port 8080, and deny everything else.

Implementing Network Policies

Start by defining a default deny-all policy for the namespace. Then create policies that explicitly allow required traffic. Use pod selectors and namespace selectors to scope rules. For example, a policy that allows ingress to database pods only from backend pods with label `role: api`. Test policies in staging because overly restrictive rules can break connectivity. Tools like Calico, Cilium, or Weave Net provide enhanced network policy capabilities, including cluster-wide policies and support for non-IP protocols.

Securing Containers with Security Contexts

Security contexts control what a container can do. By default, containers run as root inside the container (though not on the host if user namespaces are used). Still, it’s safer to run as a non-root user. Set `runAsUser: 1000` and `runAsGroup: 3000` in the pod spec, and ensure the base image has a user with matching ID. Also, drop all Linux capabilities with `capabilities: drop: ["ALL"]` and add only those needed (e.g., `NET_BIND_SERVICE` for binding to privileged ports). Set `readOnlyRootFilesystem: true` to prevent writes to the container filesystem, reducing the attack surface for malware.

Security is not a one-time setup; it requires continuous review. Regularly audit your policies and security contexts using tools like kube-bench, kube-hunter, or commercial scanners. By layering network policies and security contexts, you create multiple barriers against attackers, making your cluster more resilient to breaches.

Pitfall #5: Underestimating Storage and State Management

Containers are ephemeral by design, but many applications require persistent storage. A common pitfall is assuming that container storage is reliable without proper configuration. This leads to data loss, corruption, and performance issues. This section covers how to manage stateful workloads in container orchestration.

Challenges with Stateful Containers

When a pod is rescheduled to a different node, its local storage disappears. For databases, message queues, or file storage, this is unacceptable. Kubernetes addresses this with PersistentVolumes (PV) and PersistentVolumeClaims (PVC). However, misconfiguring storage classes, access modes, or reclaim policies can cause problems. For example, using a ReadWriteOnce volume for a multi-replica deployment will fail because only one pod can write at a time. Similarly, setting the reclaim policy to Delete can accidentally wipe important data when a PVC is deleted.

Choosing the Right Storage Solution

Evaluate your workload’s requirements: throughput, IOPS, latency, and durability. For databases, use block storage like AWS EBS, GCE Persistent Disk, or Azure Disk with a CSI driver. These provide consistent performance and snapshots. For shared file access, use NFS-based solutions like AWS EFS or Azure Files, but be aware of performance limitations. For high-performance workloads, consider local SSDs with node-level affinity, but accept the risk of data loss on node failure unless you implement replication. Use StatefulSets for stateful applications, which provide stable network identities and ordered deployment.

Data Protection and Backup

Even with persistent storage, you need backups. Use volume snapshots via CSI drivers to create point-in-time copies. Automate backup schedules with tools like Velero (formerly Heptio Ark). Test restores regularly to ensure they work. Consider cross-region replication for disaster recovery. For databases, use native replication mechanisms (e.g., MySQL replication, MongoDB replica sets) alongside storage-level backups. Also, implement pod disruption budgets to prevent all replicas from being down simultaneously during maintenance.

Stateful workloads require careful planning. Don’t treat them like stateless microservices. Invest in proper storage configuration, backup strategies, and monitoring to avoid data disasters. Your users depend on the data being safe and available.

Pitfall #6: Skipping Observability and Logging

Without proper observability, you’re flying blind. Many teams deploy containers without centralized logging, metrics, or tracing, making it nearly impossible to diagnose issues. This pitfall turns small problems into prolonged outages. This section explains how to build a comprehensive observability stack.

The Three Pillars of Observability

Logs, metrics, and traces each serve a different purpose. Logs provide detailed records of events, metrics give aggregated data over time, and traces show the flow of requests across services. In a containerized environment, logs are ephemeral; if a pod crashes, its logs disappear. Centralized logging with a tool like Loki, Elasticsearch, or Splunk ensures logs are persisted and searchable. Metrics from Prometheus can alert on conditions like high CPU or error rates. Distributed tracing with Jaeger or Zipkin helps pinpoint latency bottlenecks in microservices.

Setting Up a Basic Stack

Start with Prometheus for metrics and Grafana for dashboards. Deploy the Prometheus Operator which automates configuration. Use service monitors to scrape metrics from pods with specific annotations. For logs, deploy Fluentd or Fluent Bit as a DaemonSet to collect logs from each node and forward them to a backend like Elasticsearch or Loki. For traces, instrument your applications with OpenTelemetry libraries and export to Jaeger or Tempo. Ensure that logs include structured fields (e.g., JSON format) for easier querying.

Common Mistakes

A common mistake is relying on default dashboards that show cluster-level metrics. Instead, create dashboards per service (e.g., request rate, error rate, latency). Another mistake is not setting up alerts for important signals like pod restarts, high memory usage, or error rate spikes. Use alerting rules in Prometheus and route them to PagerDuty or Slack. Also, avoid logging sensitive information like passwords, as logs may be accessible to many team members. Implement log rotation and retention policies to manage costs.

Observability is not optional. It’s the only way to understand what your cluster is doing, especially at scale. Invest in the tooling and practices from day one, and you’ll reduce time to resolution and improve reliability.

Frequently Asked Questions About Container Orchestration Pitfalls

This section addresses common questions that arise when teams try to fix the pitfalls described above. Understanding these nuances will help you apply the solutions more effectively.

Q1: Should I set resource limits equal to requests?

For critical services, setting requests equal to limits (Guaranteed QoS) ensures the pod is never evicted under resource pressure. However, this can waste resources if the application rarely peaks. For less critical batch jobs, you can set lower requests and higher limits (Burstable QoS) to allow overcommitment. The trade-off is risk of eviction. For development clusters, you might omit limits entirely (BestEffort QoS), but this is not recommended for production. Evaluate each workload’s criticality and resource profile.

Q2: How do I test liveness probes without causing downtime?

Test probes in a staging environment that mirrors production. Start with generous initial delays and low failure thresholds. Monitor the number of restarts and adjust. Use the `kubectl describe pod` command to see probe results and events. For HTTP probes, create a dedicated health endpoint that returns 200 only when the app is truly ready. Avoid using the main application endpoint because it might hide issues. Consider using startup probes for applications with slow initialization, which delay liveness and readiness probes until startup is complete.

Q3: What is the best way to rotate secrets?

If you use Kubernetes Secrets, rotation requires updating the secret object and restarting pods that consume it. Tools like Reloader can watch for secret changes and trigger rolling updates automatically. External secret stores like Vault can issue dynamic secrets with short lifetimes and automatically update them in the container via sidecar containers. For database credentials, consider using a proxy sidecar that handles authentication and rotation transparently. Always test rotation procedures in staging to ensure zero downtime.

Q4: Can I use the same network policy for all namespaces?

Network policies are namespace-scoped, so you need to define them per namespace. However, you can create a default deny policy in every namespace using a script or policy-as-code tool like Kyverno or OPA/Gatekeeper. For cluster-wide policies, some CNI plugins (e.g., Calico, Cilium) support cluster-wide network policies that can be applied globally. Use them sparingly to avoid overly permissive rules.

Q5: How do I back up persistent volumes?

Use volume snapshots via CSI drivers for point-in-time backups. For databases, use native backup tools (e.g., mysqldump, pg_dump) and store the backup in object storage. Velero can automate backup of both Kubernetes resources and persistent volumes. Schedule backups regularly and test restores. For critical data, replicate across regions. Remember that snapshots are not a substitute for application-level backups; they only capture the disk state at a point in time.

Building a Reliable Orchestration Strategy: Next Steps

By now, you’ve learned about six common container orchestration pitfalls and their fixes. But knowledge alone is not enough; you need a plan to implement these changes systematically. This final section provides a roadmap to transform your cluster from a source of stress to a well-oiled machine.

Prioritize and Roll Out Incrementally

Start with the highest-impact fix: set resource requests and limits for all production workloads. Use VPA to get recommendations if you lack historical data. Next, add liveness and readiness probes to critical services. Then, audit your secrets and implement a secure management process. After that, introduce network policies and security contexts. Finally, set up observability and storage best practices. Do not attempt to change everything at once; each change carries risk. Roll out in a staging cluster first, and use canary deployments to validate.

Create Runbooks and Train Your Team

Document the procedures for each fix, including how to verify success and how to roll back if something goes wrong. Create runbooks for common incident scenarios (e.g., pod crash loop, high memory usage). Conduct regular training sessions so that all team members understand the new configurations. Encourage a culture of blameless postmortems to learn from incidents without finger-pointing. Use infrastructure-as-code tools like Terraform or Pulumi to manage cluster resources declaratively, making changes auditable and reproducible.

Continuous Improvement

Orchestration is not a one-time project. As your applications evolve, so should your configurations. Regularly review resource usage, adjust limits, and update security policies. Keep up with Kubernetes releases and deprecations. Participate in community forums and read official documentation. Consider adopting service meshes like Istio for advanced traffic management and security, but only if your team has the expertise to manage the additional complexity.

The goal is not to eliminate all incidents but to reduce their frequency and impact. With the practices outlined in this guide, you’ll be able to sleep better knowing that your clusters are stable, secure, and observable. Start with one pitfall today and build momentum. Your future self will thank you.

About the Author

This article was prepared by the editorial team for this publication. We focus on practical explanations and update articles when major practices change.

Last reviewed: May 2026

Table of Contents