Architecture

Flow Diagram

The following diagram illustrates the ScaleOps GPU automation flow.

ScaleOps GPU Optimization Diagram Flow

Components

GPU optimization is orchestrated by the following components:

Component	Purpose
DCGM Exporter	Enhanced DCGM exporter that emits pod-level GPU metrics, including GPU compute utilization and memory usage
Capacity Pod	A ScaleOps-managed pod that claims a full physical GPU from Kubernetes and enables multiple workload pods to share it with fine-grained compute and memory allocation
ScaleOps Scheduler	Custom Kubernetes scheduler that packs pods requiring fractional GPU onto nodes with sufficient available GPU capacity

DCGM Exporter

ScaleOps ships an enhanced DCGM exporter that provides pod-level GPU metrics — a capability the standard dcgm-exporter does not offer.

Standard DCGM only reports metrics at the device level. When multiple pods share a single GPU, device-level metrics cannot tell you which pod is consuming what. ScaleOps’ enhanced exporter breaks this down to the individual pod, continuously emitting:

GPU compute utilization per pod — how much of the GPU’s compute capacity each pod is actually using
GPU memory usage per pod — actual memory consumption for each pod sharing the device

This pod-level visibility is what powers automated rightsizing. Without it, there is no data to drive accurate per-workload GPU resource adjustments.

There is also a practical consequence specific to automated workloads: standard DCGM only reports metrics for pods that hold a GPU resource request. Since ScaleOps removes that request from automated pods, a standard exporter would stop reporting usage for them entirely after automation — on top of the already-known limitation of no per-pod breakdown. The ScaleOps exporter is not subject to this limitation.

The DCGM exporter is deployed automatically on every GPU node as part of the ScaleOps installation.

Capacity Pod

The capacity pod is the core abstraction that makes fractional GPU sharing possible.

Kubernetes treats GPUs as whole, indivisible units — any pod requesting a GPU receives the entire device, even if it only uses a small fraction of it. The capacity pod solves this by acting as the GPU holder on behalf of the cluster.

Each capacity pod claims one full physical GPU and manages the fractional sharing of that GPU across multiple workload pods. From Kubernetes’ perspective, only standard resource requests exist — ScaleOps handles fractional allocation in its own layer underneath.

When ScaleOps automates a workload, it removes the pod’s whole-GPU request and creates a capacity pod in its place. The capacity pod inherits the workload’s scheduling metadata — node selectors, topology constraints, tolerations, and so on — so it lands on the correct node type. Because the capacity pod is the entity holding the GPU resource, it is what triggers node autoscaling, not the worker pod. It also keeps GPU nodes alive: a node won’t be scaled down as long as a capacity pod is running on it.

ScaleOps uses utilization data to determine how many capacity pods are needed. If nine worker pods each require only 0.3 of a GPU’s compute capacity, they can be consolidated onto three physical GPUs — so ScaleOps creates three capacity pods, each shared by three worker pods. This is what reduces the number of GPU nodes the cluster actually needs.

ScaleOps creates one capacity pod per GPU device that is actually in use — if workloads only need two of a node’s four GPUs, only two capacity pods are created. Each capacity pod independently manages its own available capacity and the set of workload pods sharing it. This also enables GPU-level isolation between automated and non-automated pods: both can safely co-exist on the same node without interfering with each other at the GPU level.

ScaleOps Scheduler

The ScaleOps scheduler is a custom Kubernetes scheduler responsible for placing every automated fractional GPU pod.

It does not replace the default Kubernetes scheduler. Only GPU Optimization automated pods go through the ScaleOps scheduler — all other pods continue to be handled by the standard scheduler with no conflicts. Because automated pods are routed exclusively to the ScaleOps scheduler and their GPU is held by a capacity pod, there is no race condition between the two schedulers.

When a workload is automated, its pod is routed to the ScaleOps scheduler. Because the pod no longer requests a whole GPU, it needs a scheduler that understands fractional capacity — the ScaleOps scheduler is that scheduler. It selects the best GPU for the pod by:

Tracking per-GPU capacity — the scheduler maintains a real-time view of available compute and memory for every GPU across every node in the cluster. On nodes with multiple GPUs, each GPU is tracked independently.
Bin-packing — the scheduler prefers GPUs that already host other fractional workloads, consolidating onto fewer active devices to reduce the total number of GPU nodes needed.
Triggering autoscaling — when no existing GPU has enough remaining capacity, the scheduler triggers provisioning of a new GPU node and places the workload there once it’s ready.

How the Components Work Together

GPU Architecture Diagram

The GPU optimization flow operates as a continuous loop:

Metrics Collection — The DCGM exporter collects pod-level GPU compute and memory metrics on every GPU node, including nodes where multiple pods share the same physical GPU.
Rightsizing Analysis — The ScaleOps rightsizing engine analyzes collected metrics against the workload’s configured policy. It determines whether a workload’s GPU compute or memory allocation should be adjusted — scaling up to prevent performance degradation or scaling down to reclaim unused capacity.
Fractional Allocation — When a workload is automated, ScaleOps removes its whole-GPU request and creates a capacity pod that holds the physical GPU. The capacity pod inherits the workload’s scheduling metadata, triggers node provisioning if needed, and enables multiple worker pods to share the same device.
Scheduling — The ScaleOps scheduler evaluates available capacity across all GPUs in the cluster and places the workload on the best fit, using bin-packing to consolidate onto fewer active GPUs.
Continuous Optimization — The loop repeats, with metrics driving ongoing rightsizing adjustments. As workload patterns change, GPU allocations adapt automatically.

FAQ

Can I use my own DCGM exporter without ScaleOps DCGM exporter?

No. ScaleOps GPU Optimization relies on the ScaleOps enhanced DCGM exporter, which provides pod-level GPU metrics. A standard dcgm-exporter alone does not provide what ScaleOps needs for rightsizing and fractional GPU. If you already run another DCGM exporter, see the next FAQ for how both can coexist.

What happens if I already have a DCGM exporter running?

Most DCGM metrics can be collected concurrently by more than one exporter on the same node without conflict. A small set of profiling metrics is different: only one DCGM exporter on a node can collect them at a time. When ScaleOps detects another DCGM exporter with profiling enabled, it turns off its own collection of the overlapping profiling metrics that would conflict. That avoids failures and keeps your exporter’s behavior intact while ScaleOps still ingests everything else it can collect safely in parallel.

To prevent duplicate metrics, ScaleOps only scrapes from your DCGM exporter metrics that it does not already collect itself.

Is ScaleOps scheduler replacing the Kubernetes scheduler?

ScaleOps scheduler does not replace the Kubernetes scheduler in your cluster. Only GPU Optimization automated pods go through our scheduler, and they go exclusively through it. All other pods are handled by the regular scheduler, and there are no collisions or conflicts between the two schedulers.

How does ScaleOps handle nodes with multiple GPUs?

ScaleOps creates one capacity pod per GPU device that is in use — not necessarily one per every GPU on the node. The scheduler tracks available capacity separately for each GPU and treats each as a distinct bin when making placement decisions.

When a new workload needs scheduling, the scheduler evaluates all GPUs across all nodes — not just per-node availability — so it can find the single best GPU for the workload regardless of which node it lives on.

Why does the capacity pod inherit the workload’s scheduling metadata?

The capacity pod is the entity that claims a GPU and triggers node provisioning. For it to land on the correct node type — respecting the workload’s affinity rules, availability zone constraints, or hardware requirements — it needs to carry the same scheduling configuration as the workload. ScaleOps handles this automatically, so no additional configuration is needed.