Automated Fractional GPUs
ScaleOps Automated Fractional GPUs combines fractional GPU allocation with continuous rightsizing, allowing multiple pods to share a single physical GPU with fine-grained resource allocation, node autoscaling support, and GPU-level interoperability. ScaleOps automatically adjusts GPU compute and memory allocations in real time based on actual pod-level usage.
How Fractional Allocation Works
In standard Kubernetes, GPU resources are allocated as whole units (nvidia.com/gpu: 1). Even if a workload uses only 20% of a GPU’s compute capacity, the entire device is reserved and unavailable to other pods.
ScaleOps changes this by introducing a capacity pod abstraction. A capacity pod claims the full GPU device from Kubernetes and manages the sharing of that GPU across multiple workload pods:
- Capacity pod claims the GPU — a ScaleOps-managed capacity pod requests the full
nvidia.com/gpu: 1resource, holding the physical GPU device - Workload pods receive fractional shares — instead of requesting a whole GPU, automated workload pods are assigned a specific fraction of GPU compute and a specific amount of GPU memory via annotations
- Multiple pods share one GPU — several workload pods can be associated with the same capacity pod, each receiving its defined share of the GPU
This model is transparent to Kubernetes — the cluster scheduler sees standard resource requests and doesn’t need to understand fractional GPUs. ScaleOps handles the fractional management layer.
Fractional GPU Annotations
ScaleOps automatically adds annotations to automated workloads in order to define fine-grained fractional GPU allocation:
metadata:
annotations:
scaleops.sh/gpu-compute-fraction: "0.25"
scaleops.sh/gpu-memory: "1024"| Annotation | Description |
|---|---|
scaleops.sh/gpu-compute-fraction | The fraction of GPU compute allocated to this pod (e.g., 0.25 = 25% of the GPU’s compute capacity) |
scaleops.sh/gpu-memory | The amount of GPU memory allocated to this pod, in MiB |
These annotations are managed automatically by ScaleOps and are updated as rightsizing adjustments are made.
Fractional GPU Usage Monitoring
ScaleOps proprietary DCGM exporter offers pod-level granularity GPU usage monitoring, even when multiple pods share a single physical GPU. This is a key differentiator — standard DCGM only reports metrics at the device level, making it impossible to understand which pod is consuming what. ScaleOps provides per-pod visibility into:
- GPU compute utilization — how much of the allocated compute fraction each pod is using
- GPU memory usage — actual memory consumption per pod
This pod-level visibility is what powers the automated rightsizing — without it, there’s no data to drive accurate resource adjustments.

Continuous Rightsizing
ScaleOps continuously rightsizes GPU compute and memory in real time, based on pod-level GPU usage metrics, with out-of-the-box production-ready policies. Rightsizing is applied to automated workloads — automating a workload enables ScaleOps to continuously manage its GPU compute fraction and memory allocation based on actual usage.
GPU Workloads
The GPU Workloads page offers a complete view of the GPU workloads within the cluster, including GPU usage, fractional requests, potential cost savings, and more.
GPU Workload Overview
The GPU Workload Overview provides an over-time visualization of resources.
