Skip to Content
GPU OptimizationAutomated Fractional GPUs

Automated Fractional GPUs

ScaleOps Automated Fractional GPUs combines fractional GPU allocation with continuous rightsizing, allowing multiple pods to share a single physical GPU with fine-grained resource allocation, node autoscaling support, and GPU-level interoperability. ScaleOps automatically adjusts GPU compute and memory allocations in real time based on actual pod-level usage.

How Fractional Allocation Works

In standard Kubernetes, GPU resources are allocated as whole units (nvidia.com/gpu: 1). Even if a workload uses only 20% of a GPU’s compute capacity, the entire device is reserved and unavailable to other pods.

ScaleOps changes this by introducing a capacity pod abstraction. A capacity pod claims the full GPU device from Kubernetes and manages the sharing of that GPU across multiple workload pods:

  1. Capacity pod claims the GPU — a ScaleOps-managed capacity pod requests the full nvidia.com/gpu: 1 resource, holding the physical GPU device
  2. Workload pods receive fractional shares — instead of requesting a whole GPU, automated workload pods are assigned a specific fraction of GPU compute and a specific amount of GPU memory via annotations
  3. Multiple pods share one GPU — several workload pods can be associated with the same capacity pod, each receiving its defined share of the GPU

This model is transparent to Kubernetes — the cluster scheduler sees standard resource requests and doesn’t need to understand fractional GPUs. ScaleOps handles the fractional management layer.

Fractional GPU Annotations

ScaleOps automatically adds annotations to automated workloads in order to define fine-grained fractional GPU allocation:

metadata: annotations: scaleops.sh/gpu-compute-fraction: "0.25" scaleops.sh/gpu-memory: "1024"
AnnotationDescription
scaleops.sh/gpu-compute-fractionThe fraction of GPU compute allocated to this pod (e.g., 0.25 = 25% of the GPU’s compute capacity)
scaleops.sh/gpu-memoryThe amount of GPU memory allocated to this pod, in MiB

These annotations are managed automatically by ScaleOps and are updated as rightsizing adjustments are made.

Fractional GPU Usage Monitoring

ScaleOps proprietary DCGM exporter offers pod-level granularity GPU usage monitoring, even when multiple pods share a single physical GPU. This is a key differentiator — standard DCGM only reports metrics at the device level, making it impossible to understand which pod is consuming what. ScaleOps provides per-pod visibility into:

  • GPU compute utilization — how much of the allocated compute fraction each pod is using
  • GPU memory usage — actual memory consumption per pod

This pod-level visibility is what powers the automated rightsizing — without it, there’s no data to drive accurate resource adjustments.

GPU Pods Table

Continuous Rightsizing

ScaleOps continuously rightsizes GPU compute and memory in real time, based on pod-level GPU usage metrics, with out-of-the-box production-ready policies. Rightsizing is applied to automated workloads — automating a workload enables ScaleOps to continuously manage its GPU compute fraction and memory allocation based on actual usage.

GPU Workloads

The GPU Workloads page offers a complete view of the GPU workloads within the cluster, including GPU usage, fractional requests, potential cost savings, and more.


GPU Workload Page

GPU Workload Overview

The GPU Workload Overview provides an over-time visualization of resources.


GPU Workload Overview