Overview

ScaleOps GPU Optimization implements fractional GPU allocation technology with automated GPU rightsizing to maximize GPU efficiency, and significantly reducing GPU costs.

The Problem: GPU Waste in Kubernetes

By default, Kubernetes allocates GPUs as whole, indivisible units. A pod requesting any amount of GPU receives an entire device — even if it only uses a fraction of its compute or memory capacity. This leads to significant waste, especially for inference workloads that often utilize only a small fraction of the GPU’s capacity.

While NVIDIA provides several GPU sharing solutions for Kubernetes, each comes with significant trade-offs that limit their effectiveness for dynamic, production environments:

Approach	How It Works	Limitations
Time-Slicing	Multiple pods take turns using the GPU	Resource isolation, fine-grained fractions, static slices, manual configuration
MPS (Multi-Process Service)	Allows concurrent GPU access from multiple processes	Error isolation, complex configuration
MIG (Multi-Instance GPU)	Physically partitions a GPU into isolated instances, each with dedicated compute and memory	Static and limited partitions, not supported in all GPUs, manual configuration

How ScaleOps Approaches GPU Optimization

ScaleOps provides an integrated, automated solution that combines fractional GPU allocation with continuous rightsizing and GPU-aware scheduling — all working together out of the box.

Allocates fractional GPU shares — workloads receive precise fractions of GPU compute and memory (e.g., 0.25 GPU, 8 GiB), rather than entire devices
Continuously rightsizes — GPU requests are automatically adjusted based on real-time and historical usage, driven by production-ready policies
Schedules GPU-aware — a custom scheduler intelligently places fractional workloads to maximize GPU utilization across the cluster, integrated with node autoscaling

This means multiple workloads can safely share the same physical GPU with proper resource boundaries, while ScaleOps continuously ensures each workload has the right amount of GPU resources — no more, no less.

Fractional GPU Allocation

Kubernetes assigns entire GPUs to pods, even if they only need a fraction of GPU compute or memory. ScaleOps supports automated fractional allocation, allowing multiple pods to share the same GPU. Instead of allocating an entire GPU, workloads are automatically assigned fractions of GPU compute and memory (e.g., 0.25 GPU, 8 GiB memory). These fractions ensure that each pod gets exactly the share of GPU it requires without waste.

Dynamic GPU Sharing

How it works:

ScaleOps introduces the concept of capacity pods, which act as GPU abstraction layers. A capacity pod claims a full GPU device from Kubernetes and then manages the fractional allocation of that GPU’s compute and memory to multiple workload pods. This approach is transparent to the Kubernetes scheduler — it sees standard resource requests — while ScaleOps handles the fine-grained sharing underneath.
When a workload is automated, ScaleOps replaces the pod’s whole-GPU request with fractional annotations specifying the exact GPU compute fraction and memory amount the workload needs. The workload pod itself no longer requests nvidia.com/gpu: 1 — instead, it is associated with a capacity pod that holds the physical GPU.
Multiple workload pods can share the same capacity pod (and therefore the same physical GPU), each receiving an isolated slice of compute and memory according to their annotations.

Learn more about fractional GPU allocation in the Automated Fractional GPUs page.

Automated Rightsizing

ScaleOps automatically monitors real-time GPU usage metrics at the pod level, continuously managing and optimizing GPU Compute and Memory requests. Driven by rightsizing policies, ScaleOps automates the workload rightsizing, ensuring each workload receives the GPU resources it needs without manual intervention.

GPU Automated Workload

How it works:

GPU Rightsizing is managed by Policies, either built-in or custom. Policies determine the rightsizing behavior for different types of workloads like inference, training, and build — automatically applying the right policy to each workload to ensure optimal rightsizing behavior.
ScaleOps collects pod-level GPU metrics (compute utilization and memory usage) through its enhanced DCGM exporter. Unlike standard DCGM which only reports device-level metrics, ScaleOps provides per-pod visibility even when multiple pods share the same GPU.
Based on real-time usage, historical data, and the rightsizing policy, ScaleOps automatically adjusts each workload’s GPU compute fraction and memory allocation on an ongoing basis — scaling up when a workload needs more resources and scaling down when it’s over-provisioned.

Learn more in the Policies page.

Scheduling

Even with fractional allocation, pods need to be placed carefully to avoid GPU contention and maximize utilization. ScaleOps addresses this with a GPU-Aware Scheduler that understands fractional GPU requests and makes intelligent placement decisions.

How it works:

The ScaleOps scheduler tracks the available compute and memory capacity on each GPU across the cluster. When a new fractional workload needs scheduling, it selects the optimal GPU and capacity pod based on available headroom.
The scheduler uses a bin-packing strategy to maximize GPU utilization — it prefers placing workloads on GPUs that already have other fractional workloads, rather than spreading them across underutilized devices. This reduces the total number of GPUs (and GPU nodes) needed.
When no existing GPU has sufficient remaining capacity, the scheduler integrates with the Cluster Autoscaler to trigger provisioning of new GPU nodes as needed.
Scheduling works automatically out of the box for any automated workload, without any additional configuration required.

Auto-Healing

ScaleOps provides GPU-specific auto-healing that detects and responds to GPU resource pressure conditions:

GPU Memory Pressure — When a pod experiences GPU memory pressure, ScaleOps automatically increases the GPU memory allocation to prevent failures and restore stability.
GPU Compute Pressure — When a pod is consistently under GPU compute pressure, ScaleOps increases the GPU compute fraction to maintain performance.

When auto-healing is active, ScaleOps temporarily blocks scale-down operations on the affected workload to prevent the healing recommendation from being overridden. Auto-healing is enabled by default and can be configured per policy.

Supported Workload Types

GPU optimization supports the same Kubernetes workload types as standard workload rightsizing, including:

Deployment
StatefulSet
DaemonSet
Job
ArgoRollout
DeploymentConfig
Custom workloads — workload types you define and group with the custom-owner-grouping CRD (beyond the built-in owners above). See Custom Workloads.

The automation strategy varies by workload type — Deployments use ongoing updates (continuous optimization), while StatefulSets, DaemonSets, Jobs, and ArgoRollouts use upon pod creation updates (optimization applied when new pods are created).