A VPA robotic arm removes excess material from an oversized grey suit representing over-provisioned GKE Autopilot resource requests, revealing a well-fitted suit sized to actual pod usage. Coins labelled with pound and dollar signs fall into a piggy bank labelled Savings. Labels identify the old requests, actual usage, and the GKE Autopilot right-sizing concept.

GKE Autopilot Right-Sizing: Use VPA to Stop Paying for Resources You Don’t Use

Most teams migrating from GKE Standard to Autopilot discover the billing model has fundamentally changed, and then carry on deploying workloads exactly as before. On Standard, over-provisioned requests are absorbed into node-level billing and largely invisible. On Autopilot, every millicore and mebibyte of pod request is a direct line item. Padding requests “just in case” is no longer a harmless habit; it is a recurring charge with no ceiling.

The fix is not to guess more accurately. Treat resource requests as something you measure and converge on, using the Vertical Pod Autoscaler as your instrument. VPA is enabled by default on all Autopilot clusters; what most teams are missing is a structured workflow to actually use it.

A side-by-side diagram comparing GKE Standard and GKE Autopilot billing 
models in a data centre setting. On the left, multiple container boxes 
sit inside a single transparent node boundary with one shared price tag 
labelled Billing, and excess empty space is labelled Waste absorbed. On 
the right, individual container boxes sit outside any enclosing boundary, 
each with its own price tag, and unused space per container is labelled 
Billed. A caption reads: On Standard, waste hides in node headroom. On 
Autopilot, every request is a line item.

Why the Default Approach Is Expensive

When you deploy to Autopilot without explicit resource requests, GKE injects defaults of 0.5 vCPU and 2 GiB of memory per container. That is conservative enough to run most containers, but almost certainly wrong for production workloads with established usage profiles. More importantly, Autopilot’s CPU-to-memory ratio enforcement can silently inflate your requests further. If you specify only CPU, GKE calculates the missing memory value based on the compute class ratio. If your specified values land outside the allowed ratio (1:1 to 1:6.5 for general-purpose, a hard 1:4 for Scale-Out), Autopilot bumps the lower dimension upward to comply. The adjusted values appear in the running pod spec under the autopilot.gke.io/resource-adjustment annotation, but unless you are actively checking, you will not notice.

The consequence is straightforward. According to Google’s internal workload research, a significant proportion of over-provisioned GKE workloads request far more than they consume, by an order of magnitude or more in the worst cases. On Standard, this waste sits idle in node headroom. On Autopilot, you pay for it continuously.

This is the same class of problem addressed in FinOps thinking applied at the infrastructure layer: cost is a product of allocation decisions made at deployment time, not just runtime behaviour.

The Three-Phase Right-Sizing Workflow

Phase 1: Deploy with explicit, verified requests

Before any VPA involvement, deploy with deliberately set requests and confirm Autopilot has not mutated them. Use server-side dry-run to preview what Autopilot will actually apply:

kubectl apply -f deployment.yaml --dry-run=server -o yaml | grep -A 10 resources

If the output differs from your manifest, Autopilot has adjusted values to satisfy ratio or minimum constraints. Fix the manifest so it round-trips cleanly. On bursting clusters (GKE 1.32.3+), the general-purpose minimum is 50m CPU / 52 MiB memory, with increments of 1m. Older clusters enforce 250m CPU minimum check your cluster version before calibrating lower bounds.

Set memory limits equal to memory requests. CPU can safely be left without a limit or set higher than the request. The distinction matters: pods with limits greater than requests run at Burstable QoS and are evicted first under node memory pressure. Memory bursting beyond what the node can reclaim triggers OOMKill. CPU bursting is safe because it degrades gracefully rather than killing the process.

Phase 2: Observe with VPA in Off mode

Deploy a VPA object targeting each Deployment in recommendation-only mode:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Off"          # recommendations only no evictions
  resourcePolicy:
    containerPolicies:
    - containerName: '*'
      controlledResources: [cpu, memory]
      minAllowed:
        cpu: 50m               # set to Autopilot's minimum for your compute class
        memory: 128Mi
      maxAllowed:
        cpu: 4
        memory: 8Gi

The VPA recommender uses an approximately eight-day rolling window of historical data. Wait at least seven to fourteen days before acting on recommendations this ensures weekly traffic patterns are captured and the LowConfidence condition has cleared.

Read recommendations with:

# Summary across all VPAs in a namespace
kubectl get vpa -n production

# Full recommendation fields (target, lowerBound, upperBound, uncappedTarget)
kubectl describe vpa my-app-vpa -n production

The target field is the primary signal. lowerBound and upperBound define the envelope outside which VPA will trigger evictions in Auto mode understanding these thresholds before enabling Auto is essential.

You can also surface recommendations without deploying any VPA object at all, via Cloud Monitoring metrics autoscaler/recommended_per_replica_request_cores and autoscaler/recommended_per_replica_request_bytes. The GKE Console’s Workloads view surfaces these as suggested values alongside historical usage graphs once a workload is at least twenty-four hours old.

Phase 3: Tune requests and optionally enable VPA Auto

Apply the target values from the VPA recommendation to your Deployment manifest, rounding up slightly for safety margin. Verify the resulting CPU:memory ratio is within Autopilot’s allowed range for your compute class before applying.

For workloads where pod eviction during resizing is acceptable, you can enable Auto mode. This mirrors the approach used in Cloud Build pipeline optimisation measure first, then automate the adjustment. Add explicit bounds to prevent VPA from over-correcting:

  updatePolicy:
    updateMode: "Auto"
    minReplicas: 2             # VPA will not evict below this replica count

Always pair Auto mode with a PodDisruptionBudget that allows at least one eviction. A PDB with maxUnavailable: 0 or minAvailable set to the total replica count will block all GKE maintenance, not just VPA evictions.

A three-panel workflow diagram titled "GKE Autopilot Right-Sizing: Three-Phase 
VPA Workflow." Phase 1, Deploy and Verify, shows a document with a green 
checkmark and the instruction to set explicit requests and check for mutations. 
Phase 2, Observe, shows a magnifying glass over a line graph with a VPA cog 
icon and the instruction to run VPA in Off mode for 7 to 14 days. Phase 3, 
Tune and Automate, shows a gauge moving from red to green with an AUTO toggle 
switch and the instruction to apply target values and optionally enable Auto. 
A caption reads: Measure first. Automate second.

Enterprise Considerations

The most consequential operational pitfall on Autopilot is running VPA and HPA simultaneously on the same metrics. When both target CPU or memory, they create a feedback loop: HPA adds pods in response to high utilisation, VPA observes lower per-pod usage and shrinks requests, HPA calculates higher utilisation and adds more pods. Use VPA on CPU and memory combined with HPA on custom or external metrics (queue depth, request rate). If you need combined horizontal and vertical scaling on the same signals, use GKE’s MultidimPodAutoscaler (MPA, autoscaling.gke.io/v1beta1).

Compute class selection is a separate right-sizing lever that teams frequently ignore. Scale-Out is worth evaluating for any horizontally-scaled stateless workload its SMT-disabled architecture delivers better single-thread performance per vCPU and, for batch, pairing it with Spot scheduling reduces costs substantially. This complements the platform-level cost work described in GCP Committed Use Discount analysis. Apply compute class selection via cloud.google.com/compute-class nodeSelector, and note that workload separation annotations trigger higher minimum resource requirements preview these with --dry-run=server before committing.

For teams running init containers, verify that each one has explicit resource requests. If init container requests are absent or set to zero, Autopilot assigns each one resources equal to the sum of all application container requests, inflating the effective pod request substantially.

Alternative Approaches

In-place pod resizing via InPlaceOrRecreate VPA mode (Public Preview, GKE 1.34.0+) reduces disruption by attempting to resize without eviction, falling back to recreate only when necessary. This is promising for stateful workloads but carries an Autopilot-specific caveat: even with resizePolicy: NotRequired, Autopilot may still evict pods to enforce minimum resources or ratio constraints. Treat it as a useful reduction in eviction frequency rather than elimination.

Key Takeaways

On Autopilot, pod requests are billing units. Every unreviewed default or padded estimate is a standing cost. Deploying VPA in Off mode costs nothing and starts generating calibrated recommendations immediately. The operational window between deploying VPA and having production-ready request values is two to three weeks after which you have data-driven requests rather than guesses. Auto mode is optional; the measurement benefit of Off mode is not.

Useful Links