GKE Autopilot Right-Sizing: VPA Configuration Guide

Most teams migrating from GKE Standard to Autopilot discover the billing model has fundamentally changed, and then carry on deploying workloads exactly as before. On Standard, over-provisioned requests are absorbed into node-level billing and largely invisible. On Autopilot, every millicore and mebibyte of pod request is a direct line item. Padding requests “just in case” is no longer a harmless habit; it is a recurring charge with no ceiling.

The fix is not to guess more accurately. Treat resource requests as something you measure and converge on, using the Vertical Pod Autoscaler as your instrument. VPA is enabled by default on all Autopilot clusters; what most teams are missing is a structured workflow to actually use it.

A side-by-side diagram comparing GKE Standard and GKE Autopilot billing
models in a data centre setting. On the left, multiple container boxes
sit inside a single transparent node boundary with one shared price tag
labelled Billing, and excess empty space is labelled Waste absorbed. On
the right, individual container boxes sit outside any enclosing boundary,
each with its own price tag, and unused space per container is labelled
Billed. A caption reads: On Standard, waste hides in node headroom. On
Autopilot, every request is a line item.

Why the Default Approach Is Expensive

When you deploy to Autopilot without explicit resource requests, GKE injects defaults of 0.5 vCPU and 2 GiB of memory per container. That is conservative enough to run most containers, but almost certainly wrong for production workloads with established usage profiles. More importantly, Autopilot’s CPU-to-memory ratio enforcement can silently inflate your requests further. If you specify only CPU, GKE calculates the missing memory value based on the compute class ratio. If your specified values land outside the allowed ratio (1:1 to 1:6.5 for general-purpose, a hard 1:4 for Scale-Out), Autopilot bumps the lower dimension upward to comply. The adjusted values appear in the running pod spec under the autopilot.gke.io/resource-adjustment annotation, but unless you are actively checking, you will not notice.

The consequence is straightforward. According to Google’s internal workload research, a significant proportion of over-provisioned GKE workloads request far more than they consume, by an order of magnitude or more in the worst cases. On Standard, this waste sits idle in node headroom. On Autopilot, you pay for it continuously.

This is the same class of problem addressed in FinOps thinking applied at the infrastructure layer: cost is a product of allocation decisions made at deployment time, not just runtime behaviour.

The Three-Phase Right-Sizing Workflow

Phase 1: Deploy with explicit, verified requests

Before any VPA involvement, deploy with deliberately set requests and confirm Autopilot has not mutated them. Use server-side dry-run to preview what Autopilot will actually apply:

kubectl apply -f deployment.yaml --dry-run=server -o yaml | grep -A 10 resources

kubectl apply -f deployment.yaml --dry-run=server -o yaml | grep -A 10 resources

If the output differs from your manifest, Autopilot has adjusted values to satisfy ratio or minimum constraints. Fix the manifest so it round-trips cleanly. On bursting clusters (GKE 1.32.3+), the general-purpose minimum is 50m CPU / 52 MiB memory, with increments of 1m. Older clusters enforce 250m CPU minimum check your cluster version before calibrating lower bounds.

Set memory limits equal to memory requests. CPU can safely be left without a limit or set higher than the request. The distinction matters: pods with limits greater than requests run at Burstable QoS and are evicted first under node memory pressure. Memory bursting beyond what the node can reclaim triggers OOMKill. CPU bursting is safe because it degrades gracefully rather than killing the process.

Phase 2: Observe with VPA in Off mode

Deploy a VPA object targeting each Deployment in recommendation-only mode:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Off"          # recommendations only no evictions
  resourcePolicy:
    containerPolicies:
    - containerName: '*'
      controlledResources: [cpu, memory]
      minAllowed:
        cpu: 50m               # set to Autopilot's minimum for your compute class
        memory: 128Mi
      maxAllowed:
        cpu: 4
        memory: 8Gi

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Off"          # recommendations only no evictions
  resourcePolicy:
    containerPolicies:
    - containerName: '*'
      controlledResources: [cpu, memory]
      minAllowed:
        cpu: 50m               # set to Autopilot's minimum for your compute class
        memory: 128Mi
      maxAllowed:
        cpu: 4
        memory: 8Gi

The VPA recommender uses an approximately eight-day rolling window of historical data. Wait at least seven to fourteen days before acting on recommendations this ensures weekly traffic patterns are captured and the LowConfidence condition has cleared.

Read recommendations with:

# Summary across all VPAs in a namespace
kubectl get vpa -n production

# Full recommendation fields (target, lowerBound, upperBound, uncappedTarget)
kubectl describe vpa my-app-vpa -n production

# Summary across all VPAs in a namespace
kubectl get vpa -n production

# Full recommendation fields (target, lowerBound, upperBound, uncappedTarget)
kubectl describe vpa my-app-vpa -n production

The target field is the primary signal. lowerBound and upperBound define the envelope outside which VPA will trigger evictions in Auto mode understanding these thresholds before enabling Auto is essential.

You can also surface recommendations without deploying any VPA object at all, via Cloud Monitoring metrics autoscaler/recommended_per_replica_request_cores and autoscaler/recommended_per_replica_request_bytes. The GKE Console’s Workloads view surfaces these as suggested values alongside historical usage graphs once a workload is at least twenty-four hours old.

Phase 3: Tune requests and optionally enable VPA Auto

Apply the target values from the VPA recommendation to your Deployment manifest, rounding up slightly for safety margin. Verify the resulting CPU:memory ratio is within Autopilot’s allowed range for your compute class before applying.

For workloads where pod eviction during resizing is acceptable, you can enable Auto mode. This mirrors the approach used in Cloud Build pipeline optimisation measure first, then automate the adjustment. Add explicit bounds to prevent VPA from over-correcting:

  updatePolicy:
    updateMode: "Auto"
    minReplicas: 2             # VPA will not evict below this replica count

  updatePolicy:
    updateMode: "Auto"
    minReplicas: 2             # VPA will not evict below this replica count

Always pair Auto mode with a PodDisruptionBudget that allows at least one eviction. A PDB with maxUnavailable: 0 or minAvailable set to the total replica count will block all GKE maintenance, not just VPA evictions.

A three-panel workflow diagram titled "GKE Autopilot Right-Sizing: Three-Phase
VPA Workflow." Phase 1, Deploy and Verify, shows a document with a green
checkmark and the instruction to set explicit requests and check for mutations.
Phase 2, Observe, shows a magnifying glass over a line graph with a VPA cog
icon and the instruction to run VPA in Off mode for 7 to 14 days. Phase 3,
Tune and Automate, shows a gauge moving from red to green with an AUTO toggle
switch and the instruction to apply target values and optionally enable Auto.
A caption reads: Measure first. Automate second.

Enterprise Considerations

The most consequential operational pitfall on Autopilot is running VPA and HPA simultaneously on the same metrics. When both target CPU or memory, they create a feedback loop: HPA adds pods in response to high utilisation, VPA observes lower per-pod usage and shrinks requests, HPA calculates higher utilisation and adds more pods. Use VPA on CPU and memory combined with HPA on custom or external metrics (queue depth, request rate). If you need combined horizontal and vertical scaling on the same signals, use GKE’s MultidimPodAutoscaler (MPA, autoscaling.gke.io/v1beta1).

Compute class selection is a separate right-sizing lever that teams frequently ignore. Scale-Out is worth evaluating for any horizontally-scaled stateless workload its SMT-disabled architecture delivers better single-thread performance per vCPU and, for batch, pairing it with Spot scheduling reduces costs substantially. This complements the platform-level cost work described in GCP Committed Use Discount analysis. Apply compute class selection via cloud.google.com/compute-class nodeSelector, and note that workload separation annotations trigger higher minimum resource requirements preview these with --dry-run=server before committing.

For teams running init containers, verify that each one has explicit resource requests. If init container requests are absent or set to zero, Autopilot assigns each one resources equal to the sum of all application container requests, inflating the effective pod request substantially.

Alternative Approaches

In-place pod resizing via InPlaceOrRecreate VPA mode (Public Preview, GKE 1.34.0+) reduces disruption by attempting to resize without eviction, falling back to recreate only when necessary. This is promising for stateful workloads but carries an Autopilot-specific caveat: even with resizePolicy: NotRequired, Autopilot may still evict pods to enforce minimum resources or ratio constraints. Treat it as a useful reduction in eviction frequency rather than elimination.

Key Takeaways

On Autopilot, pod requests are billing units. Every unreviewed default or padded estimate is a standing cost. Deploying VPA in Off mode costs nothing and starts generating calibrated recommendations immediately. The operational window between deploying VPA and having production-ready request values is two to three weeks after which you have data-driven requests rather than guesses. Auto mode is optional; the measurement benefit of Off mode is not.

GKE Autopilot Right-Sizing: Use VPA to Stop Paying for Resources You Don’t Use

Why the Default Approach Is Expensive

The Three-Phase Right-Sizing Workflow

Phase 1: Deploy with explicit, verified requests

Phase 2: Observe with VPA in Off mode

Phase 3: Tune requests and optionally enable VPA Auto

Enterprise Considerations

Alternative Approaches

Key Takeaways

Useful Links

Why the Default Approach Is Expensive

The Three-Phase Right-Sizing Workflow

Phase 1: Deploy with explicit, verified requests

Phase 2: Observe with VPA in Off mode

Phase 3: Tune requests and optionally enable VPA Auto

Enterprise Considerations

Alternative Approaches

Key Takeaways

Useful Links

Related Posts

CPU Is the Wrong Signal for Most Autoscaling Policies

Your Database Connection Pool Is an Architectural Decision

Log Retention Is a Risk Decision, Not a Storage Decision

Trending now