Skip to Content
GPU OptimizationGPU Memory Optimization

GPU Memory Optimization
Available in v1.30.9+

vLLM reserves ~90% of available GPU memory at startup by default, regardless of actual usage. This prevents ScaleOps from reclaiming unused GPU memory and efficiently bin-packing vLLM workloads across nodes.

ScaleOps supports vLLM-aware GPU memory optimization with full automation. ScaleOps observes the real GPU memory usage of each vLLM instance and automatically manages its GPU memory allocation — ensuring each workload gets exactly what it needs, no more.

How It Works

Visibility — Real GPU Memory Usage

ScaleOps collects the real GPU memory usage of each vLLM instance, taking into account:

  • Model weights — the fixed memory footprint of the loaded model
  • KV cache for active inference — memory consumed by currently in-flight requests
  • Prefix cache headroom — memory needed to maintain a high prefix cache hit rate

This real usage data replaces the misleading ~90% allocation figure, giving ScaleOps an accurate picture of actual GPU memory demand.

Automation — Automatic GPU Memory Management

Once real usage data is available, ScaleOps automatically manages the GPU memory allocation for each vLLM workload. ScaleOps calculates the optimal allocation based on observed real usage over the configured observation window, applying it to new pods from the start and updating running pods when the gap between current and recommended allocation is significant enough to justify a restart. If demand increases and the current allocation becomes insufficient, ScaleOps detects this and increases the recommendation accordingly.

Compatibility

ComponentSupported versions
vLLM> v0.7.0 (v0 and v1 engines)
Triton with vLLM backendOfficial image versions > v25.10
Workload typesDeployments, StatefulSets, Argo Rollouts, COGs

GitOps Support

vLLM automation can be configured through GitOps using workload annotations, AutomatedNamespace (ANS), and ClusterAutomation Helm values.

Cluster-wide injection — A cluster-level toggle (vllm.observability.enabled, also configurable via Helm values) controls whether ScaleOps collects real GPU memory usage data from vLLM pods. When enabled, ScaleOps begins collecting real memory usage data across all detected vLLM workloads. Plugin injection can also be disabled per-workload via annotation (equivalent to “Exclude from vLLM observability” in the UI).

Workload annotations:

AnnotationTypeDescription
scaleops.sh/default-vllm-autobooleanEnable or disable vLLM automation for the workload
scaleops.sh/default-vllm-policystringAssign a vLLM policy to the workload

AutomatedNamespace fields:

FieldTypeDescription
vllmOptimizebooleanEnable or disable vLLM optimization for all workloads in the namespace
defaultVllmPolicystringSet the default vLLM policy for the namespace

ClusterAutomation Helm fields (see Helm reference):

FieldTypeDescription
clusterAutomation.vllm.optimizebooleanEnable or disable vLLM automation cluster-wide
clusterAutomation.vllm.defaultPolicystringSet the default vLLM policy cluster-wide

vLLM Policy

A dedicated vLLM Policy controls the optimization behavior per workload:

ParameterDescription
Observation windowTime window used for history-based recommendations (default: 2 days)
Window coverageMinimum data coverage required within the observation window
PercentileUsage percentile to base the recommendation on: 100%, 99%, 95%, 90%, 85%, or 80%
Automation optimization strategyControls when recommendations are applied: upon pod creation only, or ongoing (also restarts running pods when the allocation gap is significant)

Dashboard

vLLM Workloads Overview

Navigate to GPU Rightsizing → vLLM Workloads in the left sidebar for a full view of all vLLM workloads. The page includes:

  • Cost summary — total monthly GPU cost and percentage of wasted spend
  • Over-provisioned workloads — count of workloads with optimization opportunities
  • Automation status — pie chart showing automated vs. un-automated workloads, with an Automate All action
  • GPU Memory over time — aggregated 7-day and 30-day chart showing real usage, optimized memory usage, waste, request, and allocatable
  • Workloads table — per-workload breakdown with savings available, GPU compute and memory requests, vLLM memory usage (real vs. optimized), replicas, assigned policy, and automation toggle

vLLM Workloads Overview

Workload Overview — vLLM Tab

The workload overview page includes a vLLM tab for any vLLM workload. This tab shows:

  • GPU Memory and real vLLM usage over time — real memory usage, optimized memory usage, current request, and original request

vLLM Workload Overview tab