GPU Memory Optimization Available in v1.30.9+
vLLM reserves ~90% of available GPU memory at startup by default, regardless of actual usage. This prevents ScaleOps from reclaiming unused GPU memory and efficiently bin-packing vLLM workloads across nodes.
ScaleOps supports vLLM-aware GPU memory optimization with full automation. ScaleOps observes the real GPU memory usage of each vLLM instance and automatically manages its GPU memory allocation — ensuring each workload gets exactly what it needs, no more.
How It Works
Visibility — Real GPU Memory Usage
ScaleOps collects the real GPU memory usage of each vLLM instance, taking into account:
- Model weights — the fixed memory footprint of the loaded model
- KV cache for active inference — memory consumed by currently in-flight requests
- Prefix cache headroom — memory needed to maintain a high prefix cache hit rate
This real usage data replaces the misleading ~90% allocation figure, giving ScaleOps an accurate picture of actual GPU memory demand.
Automation — Automatic GPU Memory Management
Once real usage data is available, ScaleOps automatically manages the GPU memory allocation for each vLLM workload. ScaleOps calculates the optimal allocation based on observed real usage over the configured observation window, applying it to new pods from the start and updating running pods when the gap between current and recommended allocation is significant enough to justify a restart. If demand increases and the current allocation becomes insufficient, ScaleOps detects this and increases the recommendation accordingly.
Compatibility
| Component | Supported versions |
|---|---|
| vLLM | > v0.7.0 (v0 and v1 engines) |
| Triton with vLLM backend | Official image versions > v25.10 |
| Workload types | Deployments, StatefulSets, Argo Rollouts, COGs |
GitOps Support
vLLM automation can be configured through GitOps using workload annotations, AutomatedNamespace (ANS), and ClusterAutomation Helm values.
Cluster-wide injection — A cluster-level toggle (vllm.observability.enabled, also configurable via Helm values) controls whether ScaleOps collects real GPU memory usage data from vLLM pods. When enabled, ScaleOps begins collecting real memory usage data across all detected vLLM workloads. Plugin injection can also be disabled per-workload via annotation (equivalent to “Exclude from vLLM observability” in the UI).
Workload annotations:
| Annotation | Type | Description |
|---|---|---|
scaleops.sh/default-vllm-auto | boolean | Enable or disable vLLM automation for the workload |
scaleops.sh/default-vllm-policy | string | Assign a vLLM policy to the workload |
AutomatedNamespace fields:
| Field | Type | Description |
|---|---|---|
vllmOptimize | boolean | Enable or disable vLLM optimization for all workloads in the namespace |
defaultVllmPolicy | string | Set the default vLLM policy for the namespace |
ClusterAutomation Helm fields (see Helm reference):
| Field | Type | Description |
|---|---|---|
clusterAutomation.vllm.optimize | boolean | Enable or disable vLLM automation cluster-wide |
clusterAutomation.vllm.defaultPolicy | string | Set the default vLLM policy cluster-wide |
vLLM Policy
A dedicated vLLM Policy controls the optimization behavior per workload:
| Parameter | Description |
|---|---|
| Observation window | Time window used for history-based recommendations (default: 2 days) |
| Window coverage | Minimum data coverage required within the observation window |
| Percentile | Usage percentile to base the recommendation on: 100%, 99%, 95%, 90%, 85%, or 80% |
| Automation optimization strategy | Controls when recommendations are applied: upon pod creation only, or ongoing (also restarts running pods when the allocation gap is significant) |
Dashboard
vLLM Workloads Overview
Navigate to GPU Rightsizing → vLLM Workloads in the left sidebar for a full view of all vLLM workloads. The page includes:
- Cost summary — total monthly GPU cost and percentage of wasted spend
- Over-provisioned workloads — count of workloads with optimization opportunities
- Automation status — pie chart showing automated vs. un-automated workloads, with an Automate All action
- GPU Memory over time — aggregated 7-day and 30-day chart showing real usage, optimized memory usage, waste, request, and allocatable
- Workloads table — per-workload breakdown with savings available, GPU compute and memory requests, vLLM memory usage (real vs. optimized), replicas, assigned policy, and automation toggle
Workload Overview — vLLM Tab
The workload overview page includes a vLLM tab for any vLLM workload. This tab shows:
- GPU Memory and real vLLM usage over time — real memory usage, optimized memory usage, current request, and original request
