Policies
ScaleOps comes with several out-of-the-box production-ready GPU policies that cover the most common GPU workload patterns.
Policies control how GPU compute and memory recommendations are calculated and applied.
Ready-to-Deploy Policies
ScaleOps learns and understands your workload’s behavior and automatically applies the best policy for your workload.
You can also manually select a policy for your workload. The following policies are created by default:
When ScaleOps automatically detects a training or build workload it also creates the following policies — respectively:
Note: The policies are created in the scaleops-system namespace.
Real-time
A more conservative policy designed for latency-sensitive inference workloads. Uses a longer history window and higher headroom to accommodate variable request patterns.
Recommended for: Model serving, real-time inference, and latency-sensitive GPU workloads.
Real-time-mps
Similar to the Real-time policy but utilizes NVIDIA MPS (Multi-Process Service) technology for optimized GPU sharing performance.
Recommended for: Inference workloads that require low latency and high throughput.
Note: ScaleOps enforces workload isolation when using the Real-time MPS policy — different automated workloads assigned this policy cannot share the same GPU. Each MPS-enabled workload receives its own dedicated GPU device, ensuring no interference between workloads.
Near-real-time
A policy that balances responsiveness with efficiency for near-real-time inference workloads. Uses a shorter history window and tighter headroom than the Real-time policy, making it a good fit for workloads with more predictable request patterns that don’t require the most conservative resource cushion.
Recommended for: Inference workloads with moderate latency tolerance and relatively stable GPU utilization.
Batch
A high-efficiency policy for non-latency-sensitive workloads. Uses the longest history window and minimal headroom, prioritizing GPU utilization over responsiveness. Suitable for workloads that run to completion without serving live traffic.
Recommended for: Offline scoring, batch inference pipelines, and other throughput-oriented GPU workloads.
Training
An efficiency-focused policy for training workloads that typically have consistent GPU utilization.
Recommended for: Model training, fine-tuning, and other GPU workloads with predictable utilization.
Build
Optimized for machine learning development workloads that provides moderate resource allocation for interactive experimentation and prototyping.
Recommended for: GPU-accelerated build and experimentation workloads.