Spot Optimization

ScaleOps Spot Optimization helps you reduce costs by intelligently managing Kubernetes workloads across Spot and On-Demand instances. It operates at the workload level, analyzing workload configurations and providing automatic scheduling to safely shift replicas to Spot nodes while ensuring the required number of replicas remain on On-Demand nodes for stability.

ℹ️

This feature is currently available for AWS and Azure clusters with Karpenter installed, and on Azure and GCP with Cluster Autoscaler enabled.

Key Features

Split One Workload to Spot and On-Demand: Optimizes at the pod-level, shifting pods of the same workload between On-Demand and Spot to ensure uptime while leveraging Spot’s lower cost
Workload-level Policies: A policy is applied to each workload, controlling the Spot optimization behavior and continuously maintaining it, responding to any disruptions with fallback to On-Demand when needed
Context-aware Auto Policy Detection: ScaleOps automatically detects the best policy for each workload to maximize savings while meeting availability and performance needs
Workload-Based Custom Policies: Define custom policies with tailored logic to fine-tune Spot optimization behavior

Getting Started

spotoptimizationpolicy

ℹ️

Before starting, ensure you have completed the cloud-specific prerequisites for your platform. See the Cloud Provider Support section below for links to detailed setup instructions.

Navigate to the Spot Optimization Workloads page.
Identify workloads that were automatically detected and assigned the Spot Friendly or Spot Friendly High Availability policy.
Enable automation for the selected workloads, or click Automate Now to automate all the workloads.
The Preparation Steps will be executed automatically.
Once completed, the workload’s replicas will be safely evicted and updated to support Spot scheduling, while maintaining workload availability.
ScaleOps will begin scheduling pods on Spot instances based on your configuration.
As replica counts change or pods restart, ScaleOps will continuously maintain the desired Spot replica percentage for each workload.
Review the workload overview Spot tab, to see spot percentage, workload cost over time and more.

spotoptimizationpolicy

Policies & Automation

How It Works

ScaleOps automatically manages your workloads across Spot and On-Demand instances to maximize cost savings while maintaining availability.

Built-in Policies

ScaleOps comes with several built-in policies to handle different spot optimization scenarios:

spot-friendly-high-availability: up-to 80% Spot (min 1 replica), with fallback to On-Demand
spot-friendly: up-to 100% Spot with fallback to On-Demand
keep-original: Preserves the workload’s original scheduling behavior
always-spot: Forces up-to 100% Spot without fallback to On-Demand
always-on-demand: Forces 100% On-Demand scheduling

Workload Detection

ScaleOps automatically detects workloads that are suitable for Spot instances and applies appropriate policies based on workload characteristics:

Spot-Friendly High Availability Policy

For workloads that require high availability, ScaleOps applies the spot-friendly-high-availability policy which:

Maintains 80% of replicas on Spot nodes (with a minimum of 1 replica on On-Demand nodes)
Applied to workloads that have multiple replicas or autoscaling capabilities
Includes automatic fallback to On-Demand nodes if Spot instances are not available, ensuring workload availability

Spot-Friendly Policy

For workloads that can run entirely on spot instances, ScaleOps applies the spot-friendly policy which:

Maintains 100% of replicas on Spot instances
Applied to workloads that are already configured for spot instances or don’t require high availability guarantees
Includes automatic fallback to On-Demand nodes if Spot instances are not available, ensuring workload continuity

ℹ️

Both spot-friendly policies include automatic fallback to On-Demand nodes when Spot instances are not available. This ensures your workloads continue running even during Spot capacity constraints.

Keep-Original Policy

For workloads that don’t match the spot-friendly detection criteria, ScaleOps automatically assigns the keep-original policy:

Maintains the original behavior of the workload when automated
Applied to workloads that don’t meet the criteria for spot-friendly policies
Serves as a placeholder that can be changed to any other policy based on your requirements
No changes to the workload’s scheduling behavior when automated

Always-Spot Policy

The always-spot policy forces workloads to run exclusively on Spot instances without any fallback to On-Demand nodes. This policy is suitable for:

Non-critical workloads that can tolerate interruptions
Batch processing jobs that can be restarted
Development and testing environments where cost savings are prioritized over availability

Always-On-Demand Policy

The always-on-demand policy ensures workloads run exclusively on On-Demand instances, providing maximum stability and availability. This policy is suitable for:

Critical production workloads that cannot tolerate interruptions
Workloads with strict SLA requirements
Applications that require guaranteed resource availability

Policy Configuration

The Spot Optimization policy includes the following configurable fields:

spotoptimizationpolicy

Spot Replica Percentage

Defines the target percentage of pods that should run on Spot instances
Ensures the required number of On-Demand replicas is maintained
Dynamically adjusts on new and existing pods while maintaining high availability

On-demand minimum replicas

Defines the minimum number of replicas that should run on On-Demand nodes

Fallback to On-Demand Toggle

Controls whether workloads should fall back to On-Demand instances
Helps maintain availability during Spot instance shortages
If disabled, the Spot replicas will run only if Spot instances are available

Optimize workloads with existing PDB

Allows Spot automation for workloads that already define a Pod Disruption Budget (PDB)
When enabled, ScaleOps respects the existing PDB and proceeds with Spot optimization
When disabled, workloads with an existing customer PDB will not be optimized, and an optimization gap will be displayed

Policy Detection Rules

Policy Rules are the rules that ScaleOps uses to detect workloads and attach the relevant policy to them. The following rules are currently available:

Minimum replica count requirement, as defined by the Deployment’s spec.replicas field
Termination period requirements, as defined by the Deployment’s spec.terminationGracePeriodSeconds field
Unevictable workloads (i.e. not marked with karpenter annotation to avoid eviction, or has PDB that blocks eviction)
Has HPA, to automatically detect workloads that have a horizontal pod autoscaler
Node lifecycle requirements, to detect workloads that are already configured to run on Spot or On-Demand nodes

The rules are visible and configurable in the Rules tab.

Automation Process

Preparation Steps

When Spot Optimization is enabled on the first workload, ScaleOps will automatically perform the following preparations:

Node Pool Preparation

Creates mirrored node pools with the same configuration as the original node pools, but configured for the desired lifecycle (Spot or On-Demand).

DaemonSet Handling

Updates DaemonSets to ensure they can run on the mirrored node pools. This results in a single rollout and newly created DaemonSets will be updated automatically.

ℹ️

If needed, ScaleOps automatically integrates with ArgoCD for seamless GitOps workflows. For more details, see Integrating ArgoCD and the ScaleOps Application.

Pod Disruption Budget (PDB) creation

Adds Pod Disruption Budget for workloads if not already present, configured to maintain the required number of replicas running on On-Demand nodes.

After the preparations are made, ScaleOps will maintain the configured Spot percentage through:

Active pod eviction management, while maintaining the required number of replicas running on On-Demand nodes and the overall availability of the workloads
Pod scheduling control to ensure workloads are scheduled on the correct node lifecycle

Continuous Optimization

Actively maintains the target Spot percentage
Handles pod evictions gracefully
Manages pod scheduling to maintain desired distribution
Always ensures the required number of replicas running on On-Demand nodes and the overall availability of the workloads

Optimization Gaps

ScaleOps identifies and reports optimization gaps when workloads cannot be optimized due to specific constraints or limitations. Common optimization gaps include:

Pending cloning node pools: As described in the Preparation Steps, upon initial automation new node pools are being created. This optimization gap will show while the process is in progress.
Insufficient Spot Quota (AKS & GKE): Your cloud provider subscription doesn’t have sufficient spot instance quota to provision the required spot nodes. Access your cloud provider and increase the spot nodes quota.
Workload Optimization Postponed to Next Karpenter Consolidation: Optimization waiting for Karpenter consolidation. Learn more on Karpenter docs
Workload cannot run on any Karpenter node pool: The workload cannot run on nodes managed by Karpenter, and therefore optimization cannot be applied. Review your workload and Karpenter configurations to ensure they are compatible.

Automate High Number of Workloads for Maximum Value

To achieve optimal cost savings with Spot Optimization, it’s important to automate high number of workloads. This is derived from the Spot Optimization behavior in the different cloud providers:

AWS, Azure with Karpenter: ScaleOps uses preferred affinity for spot-friendly workloads. With only a small number of workloads automated, Karpenter may choose to schedule pods on existing on-demand nodes instead of provisioning new spot nodes, limiting your cost savings potential.

Azure, GCP with Default Cluster Autoscaler: ScaleOps uses required affinity, which forces the cluster autoscaler to provision spot nodes when needed. However, with limited workloads, you may not see a reduction in your total node count - instead, spot nodes are added alongside existing on-demand nodes, increasing your total infrastructure cost.

Frequently Asked Questions

Q: When the workload is automated, the actual amount of replicas on spot is lower than the recommended. Why?

ScaleOps maintains up-to the configured spot percentage, but the actual consolidation to spin up spot nodes depends on several factors. The consolidation process may be delayed if moving pods to spot nodes won’t scale down any On-Demand nodes, or if consolidation is configured to run only during specific time windows.

Another scenario is when Spot nodes are simply not available in your region or availability zone. In this case, if fallback to On-Demand is enabled, ScaleOps will schedule pods on On-Demand nodes to maintain workload availability, resulting in fewer spot replicas than the target percentage.

For more details on specific constraints that might affect optimization, see the Configuration Constraints section in the cloud-specific documentation.

Q: How does ScaleOps manage high availability of the workload?

ScaleOps ensures workload availability through multiple mechanisms:

Replica Management: When applying scheduling rules to enable Spot or On-Demand pod placement, ScaleOps maintains 90% ready replicas before proceeding with any changes. This ensures continuous service availability during the optimization process.

PDB Protection: ScaleOps automatically creates and manages Pod Disruption Budgets to safeguard On-Demand replicas. This prevents infrastructure components from violating the minimum required replicas that must run on On-Demand nodes. When a PDB already exists on a workload, ScaleOps respects the existing PDB instead of creating a new one.

Cloud Provider Support

Currently, Spot Optimization is available for the following cloud providers:

AWS with Karpenter - Full support with Karpenter integration
Available in v1.18.0+
Azure with Karpenter - Full support with Karpenter integration
Available in v1.22.2+
Azure with Cluster Autoscaler - Full support with AKS node pool management
Available in v1.18.8+
GCP with Cluster Autoscaler - Full support with GCP node pool management
Available in v1.19.3+

Each cloud provider may have specific requirements and configurations. Please refer to the cloud-specific documentation for detailed setup instructions.