Policies

ScaleOps comes with several out-of-the-box production-ready GPU policies that cover the most common GPU workload patterns.

Policies control how GPU compute and memory recommendations are calculated and applied.

Ready-to-Deploy Policies

ScaleOps learns and understands your workload’s behavior and automatically applies the best policy for your workload.

You can also manually select a policy for your workload. The following policies are created by default:

Real-time
Real-time-mps
Near-real-time
Batch

When ScaleOps automatically detects a training or build workload it also creates the following policies — respectively:

Training
Build

Note: The policies are created in the scaleops-system namespace.

Real-time

A more conservative policy designed for latency-sensitive inference workloads. Uses a longer history window and higher headroom to accommodate variable request patterns.

Recommended for: Model serving, real-time inference, and latency-sensitive GPU workloads.

Real-time-mps

Similar to the Real-time policy but utilizes NVIDIA MPS (Multi-Process Service) technology for optimized GPU sharing performance.

Recommended for: Inference workloads that require low latency and high throughput.

Note: ScaleOps enforces workload isolation when using the Real-time MPS policy — different automated workloads assigned this policy cannot share the same GPU. Each MPS-enabled workload receives its own dedicated GPU device, ensuring no interference between workloads.

Near-real-time

A policy that balances responsiveness with efficiency for near-real-time inference workloads. Uses a shorter history window and tighter headroom than the Real-time policy, making it a good fit for workloads with more predictable request patterns that don’t require the most conservative resource cushion.

Recommended for: Inference workloads with moderate latency tolerance and relatively stable GPU utilization.

Batch

A high-efficiency policy for non-latency-sensitive workloads. Uses the longest history window and minimal headroom, prioritizing GPU utilization over responsiveness. Suitable for workloads that run to completion without serving live traffic.

Recommended for: Offline scoring, batch inference pipelines, and other throughput-oriented GPU workloads.

Training

An efficiency-focused policy for training workloads that typically have consistent GPU utilization.

Recommended for: Model training, fine-tuning, and other GPU workloads with predictable utilization.

Build

Optimized for machine learning development workloads that provides moderate resource allocation for interactive experimentation and prototyping.

Recommended for: GPU-accelerated build and experimentation workloads.