Kubernetes Rightsizing Is a Trust Problem, Not Just a Metrics Problem
Kubernetes rightsizing works when platform teams combine resource data with rollout discipline, clear failure signals, and a path application teams can trust.
Every Kubernetes platform eventually reaches the same awkward moment. The dashboards show that many workloads request far more CPU and memory than they usually consume. Finance wants the cloud bill down. Platform engineers see wasted node capacity. Application teams hear something else: "We are taking away the buffer that kept production alive."
That is why rightsizing fails so often. Not because teams lack Prometheus, Grafana, Kubecost, VPA recommendations, or spreadsheets. It fails because resource changes are treated like a math exercise when they are really a trust exercise. A developer who has been paged for a midnight restart will not be convinced by a single p95 memory chart.

Rightsizing works best when teams move from noisy utilization signals to staged, observable changes.
Dashboard
average memory usage
Runtime
exit code last week
Team
confidence in the change
The Optimization That Sounds Like a Threat
Requests and limits are not just cost controls. In Kubernetes, requests influence scheduling, while limits are enforced by the kubelet and the kernel. The Kubernetes resource management docs are explicit about this distinction: requests reserve capacity for placement decisions, while memory limits are enforced reactively through OOM kills. That difference matters. Lowering a memory request usually changes placement and density. Lowering a memory limit can change whether a workload survives a spike.
This is where many rightsizing projects go wrong. They start with a report that says "service A requests 20 GiB but usually uses 8 GiB." That may be true and still incomplete. Maybe the service has a weekly import job. Maybe startup warms a cache. Maybe a JVM heap grows differently after a release. Maybe the application was killed by the kernel, evicted by node pressure, restarted by a probe, or terminated by a human during an incident. Those are different stories with the same superficial symptom: the pod restarted.
Memory shape, not just memory average
The first useful stance is this: do not debate whether developers are "overprovisioning." Ask what evidence would make a smaller setting feel safe. That usually means restart history, p95 and p99 usage, deploy-time behavior, queue depth, latency, garbage collection pauses, node pressure events, and a rollback path that is boring enough to use under stress.
Exit Code 137 Is Not a Diagnosis
Exit code 137 often gets translated to "OOMKilled" in hallway conversations. Sometimes that is correct. Sometimes it is not. A process can receive SIGKILL for multiple reasons, and Kubernetes state should be checked before the team rewrites resource policy around a false conclusion. Look at lastState.terminated.reason, events, node pressure, probe failures, rollout timing, and application logs around the same timestamp.
kubectl describe pod checkout-api-7f8c9
kubectl get pod checkout-api-7f8c9 -o jsonpath='{.status.containerStatuses[*].lastState}'
kubectl get events --sort-by=.lastTimestamp
kubectl top pod checkout-api-7f8c9 --containers
The second useful stance is that rightsizing should be staged. Lower requests first when the data supports it, because requests affect scheduling and bin packing. Be more careful with limits, especially memory limits, because they define the boundary where the kernel may end the process. For many latency-sensitive services, a CPU limit can also create throttling that looks like an application performance problem.
Autoscaling Helps, But It Does Not Remove Judgment
The Horizontal Pod Autoscaler can scale deployments and stateful sets based on CPU, memory, custom metrics, or external metrics. KEDA extends that model with event-driven triggers and passes scaling decisions into HPA for the 1-to-N phase. These are good tools. They are not a substitute for understanding the workload.
Recent research on predictive autoscaling for Node.js on Kubernetes makes the point well: CPU-based HPA can miss event-loop saturation, and reactive scaling can arrive after latency has already degraded. That does not mean every team needs a research-grade predictive scaler. It means "we have HPA" is not a complete answer if the user-facing symptom is queueing, tail latency, or slow startup.
Before lowering a production memory setting
- Check p95 and p99 memory over a full business cycle, not only yesterday.
- Separate startup spikes, batch windows, and steady-state serving traffic.
- Confirm whether previous restarts were OOM kills, probe failures, evictions, or manual actions.
- Canary one deployment or namespace before changing the default template.
- Agree on rollback ownership and the exact metric that triggers rollback.
The rightsizing path I trust is boring. Measure for long enough to include real behavior. Propose a smaller request with a visible safety margin. Canary it during a sane window. Watch the signals application owners care about, not only cluster utilization. If the service stays healthy, make the change permanent and update the template. If it fails, roll back and keep the evidence.
resources:
requests:
cpu: "500m" # before: "1000m"
memory: "10Gi" # before: "20Gi"
limits:
memory: "16Gi" # keep headroom while confidence grows
---
autoscaling:
minReplicas: 3
maxReplicas: 12
targetCPUUtilizationPercentage: 65
# Add queue depth, latency, or event-loop metrics when CPU is not the bottleneck.
The payoff is bigger than a lower cloud bill. Good rightsizing creates a shared language between platform and application teams: what the workload usually needs, what it needs during stress, which signals prove it is safe, and which exceptions are real. That is platform engineering. Not squeezing pods into smaller boxes, but making safer defaults possible without turning every change into a fight.
My opinion is simple: a platform team that cuts requests without building confidence is just moving risk from the bill to the pager. A platform team that rightsizes with evidence, canaries, and rollback discipline gives developers a reason to accept leaner defaults. That is the version of optimization production teams can actually live with.