Skip to Content
Core InfraSpot Optimization

Spot Optimization

ScaleOps Spot Optimization helps you reduce costs by intelligently managing Kubernetes workloads across Spot and On-Demand instances. It operates at the workload level, analyzing workload configurations and providing automatic scheduling to safely shift replicas to Spot nodes while ensuring the required number of replicas remain on On-Demand nodes for stability.

ℹ️

This feature is currently available for AWS and Azure clusters with Karpenter installed, and on Azure and GCP with Cluster Autoscaler enabled.

Key Features

  • Split One Workload to Spot and On-Demand: Optimizes at the pod-level, shifting pods of the same workload between On-Demand and Spot to ensure uptime while leveraging Spot’s lower cost
  • Workload-level Policies: A policy is applied to each workload, controlling the Spot optimization behavior and continuously maintaining it, responding to any disruptions with fallback to On-Demand when needed
  • Context-aware Auto Policy Detection: ScaleOps automatically detects the best policy for each workload to maximize savings while meeting availability and performance needs
  • Workload-Based Custom Policies: Define custom policies with tailored logic to fine-tune Spot optimization behavior

Getting Started

spotoptimizationpolicy

ℹ️

Before starting, ensure you have completed the cloud-specific prerequisites for your platform. See the Cloud Provider Support section below for links to detailed setup instructions.

  1. Navigate to the Spot Optimization Workloads page.
  2. Identify workloads that were automatically detected and assigned the Spot Friendly or Spot Friendly High Availability policy.
  3. Enable automation for the selected workloads, or click Automate Now to automate all the workloads.
  4. The Preparation Steps will be executed automatically.
  5. Once completed, the workload’s replicas will be safely evicted and updated to support Spot scheduling, while maintaining workload availability.
  6. ScaleOps will begin scheduling pods on Spot instances based on your configuration.
  7. As replica counts change or pods restart, ScaleOps will continuously maintain the desired Spot replica percentage for each workload.
  8. Review the workload overview Spot tab, to see spot percentage, workload cost over time and more.

spotoptimizationpolicy

Policies & Automation

How It Works

ScaleOps automatically manages your workloads across Spot and On-Demand instances to maximize cost savings while maintaining availability.

Built-in Policies

ScaleOps comes with several built-in policies to handle different spot optimization scenarios:

  • spot-friendly-high-availability: up-to 80% Spot (min 1 replica), with fallback to On-Demand
  • spot-friendly: up-to 100% Spot with fallback to On-Demand
  • keep-original: Preserves the workload’s original scheduling behavior
  • always-spot: Forces up-to 100% Spot without fallback to On-Demand
  • always-on-demand: Forces 100% On-Demand scheduling

Workload Detection

ScaleOps automatically detects workloads that are suitable for Spot instances and applies appropriate policies based on workload characteristics:

Spot-Friendly High Availability Policy

For workloads that require high availability, ScaleOps applies the spot-friendly-high-availability policy which:

  • Maintains 80% of replicas on Spot nodes (with a minimum of 1 replica on On-Demand nodes)
  • Applied to workloads that have multiple replicas or autoscaling capabilities
  • Includes automatic fallback to On-Demand nodes if Spot instances are not available, ensuring workload availability

Spot-Friendly Policy

For workloads that can run entirely on spot instances, ScaleOps applies the spot-friendly policy which:

  • Maintains 100% of replicas on Spot instances
  • Applied to workloads that are already configured for spot instances or don’t require high availability guarantees
  • Includes automatic fallback to On-Demand nodes if Spot instances are not available, ensuring workload continuity
ℹ️

Both spot-friendly policies include automatic fallback to On-Demand nodes when Spot instances are not available. This ensures your workloads continue running even during Spot capacity constraints.

Keep-Original Policy

For workloads that don’t match the spot-friendly detection criteria, ScaleOps automatically assigns the keep-original policy:

  • Maintains the original behavior of the workload when automated
  • Applied to workloads that don’t meet the criteria for spot-friendly policies
  • Serves as a placeholder that can be changed to any other policy based on your requirements
  • No changes to the workload’s scheduling behavior when automated

Always-Spot Policy

The always-spot policy forces workloads to run exclusively on Spot instances without any fallback to On-Demand nodes. This policy is suitable for:

  • Non-critical workloads that can tolerate interruptions
  • Batch processing jobs that can be restarted
  • Development and testing environments where cost savings are prioritized over availability

Always-On-Demand Policy

The always-on-demand policy ensures workloads run exclusively on On-Demand instances, providing maximum stability and availability. This policy is suitable for:

  • Critical production workloads that cannot tolerate interruptions
  • Workloads with strict SLA requirements
  • Applications that require guaranteed resource availability

Policy Configuration

The Spot Optimization policy includes the following configurable fields:


spotoptimizationpolicy

spotoptimizationpolicy

Spot Replica Percentage

  • Defines the target percentage of pods that should run on Spot instances
  • Ensures the required number of On-Demand replicas is maintained
  • Dynamically adjusts on new and existing pods while maintaining high availability

On-demand minimum replicas

  • Defines the minimum number of replicas that should run on On-Demand nodes

Fallback to On-Demand Toggle

  • Controls whether workloads should fall back to On-Demand instances
  • Helps maintain availability during Spot instance shortages
  • If disabled, the Spot replicas will run only if Spot instances are available

Optimize workloads with existing PDB

  • Allows Spot automation for workloads that already define a Pod Disruption Budget (PDB)
  • When enabled, ScaleOps respects the existing PDB and proceeds with Spot optimization
  • When disabled, workloads with an existing customer PDB will not be optimized, and an optimization gap will be displayed

Policy Detection Rules

Policy Rules are the rules that ScaleOps uses to detect workloads and attach the relevant policy to them. The following rules are currently available:

  • Minimum replica count requirement, as defined by the Deployment’s spec.replicas field
  • Termination period requirements, as defined by the Deployment’s spec.terminationGracePeriodSeconds field
  • Unevictable workloads (i.e. not marked with karpenter annotation to avoid eviction, or has PDB that blocks eviction)
  • Has HPA, to automatically detect workloads that have a horizontal pod autoscaler
  • Node lifecycle requirements, to detect workloads that are already configured to run on Spot or On-Demand nodes

The rules are visible and configurable in the Rules tab.

Automation Process

Preparation Steps

When Spot Optimization is enabled on the first workload, ScaleOps will automatically perform the following preparations:

Node Pool Preparation

Creates mirrored node pools with the same configuration as the original node pools, but configured for the desired lifecycle (Spot or On-Demand).

DaemonSet Handling

Updates DaemonSets to ensure they can run on the mirrored node pools. This results in a single rollout and newly created DaemonSets will be updated automatically.

ℹ️

If needed, ScaleOps automatically integrates with ArgoCD for seamless GitOps workflows. For more details, see Integrating ArgoCD and the ScaleOps Application.

Pod Disruption Budget (PDB) creation

Adds Pod Disruption Budget for workloads if not already present, configured to maintain the required number of replicas running on On-Demand nodes.

After the preparations are made, ScaleOps will maintain the configured Spot percentage through:

  • Active pod eviction management, while maintaining the required number of replicas running on On-Demand nodes and the overall availability of the workloads
  • Pod scheduling control to ensure workloads are scheduled on the correct node lifecycle

Continuous Optimization

  • Actively maintains the target Spot percentage
  • Handles pod evictions gracefully
  • Manages pod scheduling to maintain desired distribution
  • Always ensures the required number of replicas running on On-Demand nodes and the overall availability of the workloads

Optimization Gaps

ScaleOps identifies and reports optimization gaps when workloads cannot be optimized due to specific constraints or limitations. Common optimization gaps include:

  • Pending cloning node pools: As described in the Preparation Steps, upon initial automation new node pools are being created. This optimization gap will show while the process is in progress.
  • Insufficient Spot Quota (AKS & GKE): Your cloud provider subscription doesn’t have sufficient spot instance quota to provision the required spot nodes. Access your cloud provider and increase the spot nodes quota.
  • Workload Optimization Postponed to Next Karpenter Consolidation: Optimization waiting for Karpenter consolidation. Learn more on Karpenter docs
  • Workload cannot run on any Karpenter node pool: The workload cannot run on nodes managed by Karpenter, and therefore optimization cannot be applied. Review your workload and Karpenter configurations to ensure they are compatible.

Automate High Number of Workloads for Maximum Value

To achieve optimal cost savings with Spot Optimization, it’s important to automate high number of workloads. This is derived from the Spot Optimization behavior in the different cloud providers:

AWS, Azure with Karpenter: ScaleOps uses preferred affinity for spot-friendly workloads. With only a small number of workloads automated, Karpenter may choose to schedule pods on existing on-demand nodes instead of provisioning new spot nodes, limiting your cost savings potential.

Azure, GCP with Default Cluster Autoscaler: ScaleOps uses required affinity, which forces the cluster autoscaler to provision spot nodes when needed. However, with limited workloads, you may not see a reduction in your total node count - instead, spot nodes are added alongside existing on-demand nodes, increasing your total infrastructure cost.

Frequently Asked Questions

ScaleOps maintains up-to the configured spot percentage, but the actual consolidation to spin up spot nodes depends on several factors. The consolidation process may be delayed if moving pods to spot nodes won’t scale down any On-Demand nodes, or if consolidation is configured to run only during specific time windows.

Another scenario is when Spot nodes are simply not available in your region or availability zone. In this case, if fallback to On-Demand is enabled, ScaleOps will schedule pods on On-Demand nodes to maintain workload availability, resulting in fewer spot replicas than the target percentage.

For more details on specific constraints that might affect optimization, see the Configuration Constraints section in the cloud-specific documentation.

Q: How does ScaleOps manage high availability of the workload?

ScaleOps ensures workload availability through multiple mechanisms:

Replica Management: When applying scheduling rules to enable Spot or On-Demand pod placement, ScaleOps maintains 90% ready replicas before proceeding with any changes. This ensures continuous service availability during the optimization process.

PDB Protection: ScaleOps automatically creates and manages Pod Disruption Budgets to safeguard On-Demand replicas. This prevents infrastructure components from violating the minimum required replicas that must run on On-Demand nodes. When a PDB already exists on a workload, ScaleOps respects the existing PDB instead of creating a new one.

Cloud Provider Support

Currently, Spot Optimization is available for the following cloud providers:

Each cloud provider may have specific requirements and configurations. Please refer to the cloud-specific documentation for detailed setup instructions.