Skip to Content

Spark on Spot
Available in v1.26.27+

ScaleOps enables running Apache Spark workloads on Spot instances with workload auto detection of spot friendly workloads, ensuring job resilience to spot evictions while maximizing cost savings.

Overview

When running Spark workloads on Spot instances, interruptions can cause job failures and data loss. ScaleOps addresses this by automatically configuring Spark’s graceful decommissioning feature when applying the spark-spot-friendly-high-availability policy to your Spark executors.

The graceful decommissioning feature allows executors to:

  • Complete their current tasks before shutdown
  • Migrate shuffle and RDD data to other executors
  • Ensure job continuity during Spot instance interruptions

Note: Spark Driver workloads are not supported for auto detection, and will be assigned with keep-original policy.

Automating Spark Workloads

To enable Spark on Spot optimization, verify the spark-spot-friendly-high-availability policy is assigned to the workload, and automate it.

Once automated, ScaleOps will:

  • Automatically apply the decommission configuration into your Spark driver
  • Schedule executor pods on Spot instances while maintaining availability
  • Handle Spot interruptions gracefully by migrating data before eviction

How It Works

ScaleOps automatically applies the policy to Spark Executors with the following requirements:

  • Dynamic allocation enabled.
  • Average stage time for the whole job is less than two minutes.

Once the spark-spot-friendly-high-availability policy is applied to Spark executors, ScaleOps automatically configures the Spark driver with the following decommission flags:

Spark Configuration ParameterDescription
spark.decommission.enabledActivate graceful decommissioning
spark.storage.decommission.enabledEnable graceful shutdown for storage/shuffle data
spark.storage.decommission.shuffleBlocks.enabledEnsures shuffle blocks are gracefully handled during decommissioning
spark.storage.decommission.rddBlocks.enabledEnsures RDD cache blocks are gracefully handled during decommissioning
ℹ️

These configuration parameters are managed automatically by ScaleOps when the spark-spot-friendly-high-availability policy is active on the workload. You do not need to manually configure these flags.

Prerequisites

  • Spark Version: 3.1.0 or later (required for graceful decommissioning support)
  • Cloud Provider: AWS, Azure, or GCP with Spot Optimization enabled (see Cloud Provider Support)