Spark on Spot Available in v1.26.27+
ScaleOps enables running Apache Spark workloads on Spot instances with workload auto detection of spot friendly workloads, ensuring job resilience to spot evictions while maximizing cost savings.
Overview
When running Spark workloads on Spot instances, interruptions can cause job failures and data loss. ScaleOps addresses this by automatically configuring Spark’s graceful decommissioning feature when applying the spark-spot-friendly-high-availability policy to your Spark executors.
The graceful decommissioning feature allows executors to:
- Complete their current tasks before shutdown
- Migrate shuffle and RDD data to other executors
- Ensure job continuity during Spot instance interruptions
Note: Spark Driver workloads are not supported for auto detection, and will be assigned with keep-original policy.
Automating Spark Workloads
To enable Spark on Spot optimization, verify the spark-spot-friendly-high-availability policy is assigned to the workload, and automate it.
Once automated, ScaleOps will:
- Automatically apply the decommission configuration into your Spark driver
- Schedule executor pods on Spot instances while maintaining availability
- Handle Spot interruptions gracefully by migrating data before eviction
How It Works
ScaleOps automatically applies the policy to Spark Executors with the following requirements:
- Dynamic allocation enabled.
- Average stage time for the whole job is less than two minutes.
Once the spark-spot-friendly-high-availability policy is applied to Spark executors, ScaleOps automatically configures the Spark driver with the following decommission flags:
| Spark Configuration Parameter | Description |
|---|---|
spark.decommission.enabled | Activate graceful decommissioning |
spark.storage.decommission.enabled | Enable graceful shutdown for storage/shuffle data |
spark.storage.decommission.shuffleBlocks.enabled | Ensures shuffle blocks are gracefully handled during decommissioning |
spark.storage.decommission.rddBlocks.enabled | Ensures RDD cache blocks are gracefully handled during decommissioning |
These configuration parameters are managed automatically by ScaleOps when the spark-spot-friendly-high-availability policy is active on the workload. You do not need to manually configure these flags.
Prerequisites
- Spark Version: 3.1.0 or later (required for graceful decommissioning support)
- Cloud Provider: AWS, Azure, or GCP with Spot Optimization enabled (see Cloud Provider Support)