GPU Memory Optimization
Available in v1.30.9+

vLLM reserves ~90% of available GPU memory at startup by default, regardless of actual usage. This prevents ScaleOps from reclaiming unused GPU memory and efficiently bin-packing vLLM workloads across nodes.

ScaleOps supports vLLM-aware GPU memory optimization with full automation. ScaleOps observes the real GPU memory usage of each vLLM instance and automatically manages its GPU memory allocation — ensuring each workload gets exactly what it needs, no more.

How It Works

Visibility — Real GPU Memory Usage

ScaleOps collects the real GPU memory usage of each vLLM instance, taking into account:

Model weights — the fixed memory footprint of the loaded model
KV cache for active inference — memory consumed by currently in-flight requests
Prefix cache headroom — memory needed to maintain a high prefix cache hit rate

This real usage data replaces the misleading ~90% allocation figure, giving ScaleOps an accurate picture of actual GPU memory demand.

Automation — Automatic GPU Memory Management

Once real usage data is available, ScaleOps automatically manages the GPU memory allocation for each vLLM workload. ScaleOps calculates the optimal allocation based on observed real usage over the configured observation window, applying it to new pods from the start and updating running pods when the gap between current and recommended allocation is significant enough to justify a restart. If demand increases and the current allocation becomes insufficient, ScaleOps detects this and increases the recommendation accordingly.

Compatibility

Component	Supported versions
vLLM	> v0.7.0 (v0 and v1 engines)
Triton with vLLM backend	Official image versions > v25.10
Workload types	Deployments, StatefulSets, Argo Rollouts, COGs

GitOps Support

vLLM automation can be configured through GitOps using workload annotations, AutomatedNamespace (ANS), and ClusterAutomation Helm values.

Cluster-wide injection — A cluster-level toggle (vllm.observability.enabled, also configurable via Helm values) controls whether ScaleOps collects real GPU memory usage data from vLLM pods. When enabled, ScaleOps begins collecting real memory usage data across all detected vLLM workloads. Plugin injection can also be disabled per-workload via annotation (equivalent to “Exclude from vLLM observability” in the UI).

Workload annotations:

Annotation	Type	Description
`scaleops.sh/default-vllm-auto`	boolean	Enable or disable vLLM automation for the workload
`scaleops.sh/default-vllm-policy`	string	Assign a vLLM policy to the workload

AutomatedNamespace fields:

Field	Type	Description
`vllmOptimize`	boolean	Enable or disable vLLM optimization for all workloads in the namespace
`defaultVllmPolicy`	string	Set the default vLLM policy for the namespace

ClusterAutomation Helm fields (see Helm reference):

Field	Type	Description
`clusterAutomation.vllm.optimize`	boolean	Enable or disable vLLM automation cluster-wide
`clusterAutomation.vllm.defaultPolicy`	string	Set the default vLLM policy cluster-wide

vLLM Policy

A dedicated vLLM Policy controls the optimization behavior per workload:

Parameter	Description
Observation window	Time window used for history-based recommendations (default: 2 days)
Window coverage	Minimum data coverage required within the observation window
Percentile	Usage percentile to base the recommendation on: 100%, 99%, 95%, 90%, 85%, or 80%
Automation optimization strategy	Controls when recommendations are applied: upon pod creation only, or ongoing (also restarts running pods when the allocation gap is significant)

Dashboard

vLLM Workloads Overview

Navigate to GPU Rightsizing → vLLM Workloads in the left sidebar for a full view of all vLLM workloads. The page includes:

Cost summary — total monthly GPU cost and percentage of wasted spend
Over-provisioned workloads — count of workloads with optimization opportunities
Automation status — pie chart showing automated vs. un-automated workloads, with an Automate All action
GPU Memory over time — aggregated 7-day and 30-day chart showing real usage, optimized memory usage, waste, request, and allocatable
Workloads table — per-workload breakdown with savings available, GPU compute and memory requests, vLLM memory usage (real vs. optimized), replicas, assigned policy, and automation toggle

Workload Overview — vLLM Tab

The workload overview page includes a vLLM tab for any vLLM workload. This tab shows:

GPU Memory and real vLLM usage over time — real memory usage, optimized memory usage, current request, and original request

GPU Memory Optimization Available in v1.30.9+