Spark on Spot
Available in v1.26.27+

ScaleOps enables running Apache Spark workloads on Spot instances with workload auto detection of spot friendly workloads, ensuring job resilience to spot evictions while maximizing cost savings.

Overview

When running Spark workloads on Spot instances, interruptions can cause job failures and data loss. ScaleOps addresses this by automatically configuring Spark’s graceful decommissioning feature when applying the spark-spot-friendly-high-availability policy to your Spark executors.

The graceful decommissioning feature allows executors to:

Complete their current tasks before shutdown
Migrate shuffle and RDD data to other executors
Ensure job continuity during Spot instance interruptions

Note: Spark Driver workloads are not supported for auto detection, and will be assigned with keep-original policy.

Automating Spark Workloads

To enable Spark on Spot optimization, verify the spark-spot-friendly-high-availability policy is assigned to the workload, and automate it.

Once automated, ScaleOps will:

Automatically apply the decommission configuration into your Spark driver
Schedule executor pods on Spot instances while maintaining availability
Handle Spot interruptions gracefully by migrating data before eviction

How It Works

ScaleOps automatically applies the policy to Spark Executors with the following requirements:

Dynamic allocation enabled.
Average stage time for the whole job is less than two minutes.

Once the spark-spot-friendly-high-availability policy is applied to Spark executors, ScaleOps automatically configures the Spark driver with the following decommission flags:

Spark Configuration Parameter	Description
`spark.decommission.enabled`	Activate graceful decommissioning
`spark.storage.decommission.enabled`	Enable graceful shutdown for storage/shuffle data
`spark.storage.decommission.shuffleBlocks.enabled`	Ensures shuffle blocks are gracefully handled during decommissioning
`spark.storage.decommission.rddBlocks.enabled`	Ensures RDD cache blocks are gracefully handled during decommissioning

ℹ️

These configuration parameters are managed automatically by ScaleOps when the spark-spot-friendly-high-availability policy is active on the workload. You do not need to manually configure these flags.

Prerequisites

Spark Version: 3.1.0 or later (required for graceful decommissioning support)
Cloud Provider: AWS, Azure, or GCP with Spot Optimization enabled (see Cloud Provider Support)

Spark on Spot Available in v1.26.27+

Overview

Automating Spark Workloads

How It Works

Prerequisites

Spark on Spot
Available in v1.26.27+