Auto Healing and Burst Reaction
ScaleOps provides two real-time features that help workloads adapt to resource utilization anomalies and sudden load changes:
- Auto Healing — Responds to CPU stress, OOM events, and disk evictions with proactive resource adjustments
- Burst Reaction — Detects usage spikes and temporarily increases recommendations within minutes
Auto Healing
Auto Healing responds to CPU stress and OOM events with proactive resource adjustments to restore workload stability.
CPU Stress Auto Healing
ScaleOps continuously monitors cluster nodes for CPU stress conditions and implements proactive measures to alleviate stress and restore optimal workload performance.
What It Does
ScaleOps automatically identifies and resolves CPU stress conditions:
- Resource Optimization: Automatically increases CPU requests for eligible pods to facilitate recovery and prevent recurring stress conditions
- Permanent Recommendations: Provides recommendations for permanent CPU resource increases for workloads experiencing repeated CPU stress
- Scheduling Optimization: Optimizes pod scheduling to distribute workload across nodes more effectively
Out-of-Memory (OOM) Auto Healing
ScaleOps monitors and responds to various OOM-related events to protect workload stability and prevent memory-related failures.
What It Does
ScaleOps automatically detects and responds to memory-related issues:
- Memory Allocation Optimization: Automatically recommends increased memory allocation for affected workloads
- Policy Compliance: Respects workload policies when determining healing actions
- Usage-Based Recommendations: Calculates memory request increases based on recent usage patterns with additional margins for stability
Note: ScaleOps will not trigger OOM healing for limit-based OOM events unless the workload’s policy explicitly permits ScaleOps to manage resource limits.
OOM Event Types and Triggers
ScaleOps responds to multiple OOM-related event types that indicate memory pressure or failures:
- Container OOMKilled (Exit Code 137) — Triggered when a container is terminated due to exceeding its memory limit or by the kernel OOM killer.
- Kubelet Eviction: Node Memory Pressure — Occurs when the kubelet evicts pods due to node-wide memory pressure; ScaleOps applies increased memory recommendations to the restarted pod.
- System OOM (OOM-System Event) Available in v1.16.6+ — Triggered when the node lacks sufficient memory for system processes, causing user container restarts (status 137).
Disk Eviction Auto-Healing
When a pod is evicted as a result of node disk stress, ScaleOps detects the eviction and increases the ephemeral storage recommendation for that workload so future pods avoid the same issue.
What It Does
- Detects evictions caused by ephemeral storage limit violations
- Increases ephemeral storage recommendations for affected workloads
- Applies updated recommendations to new pods to prevent repeated evictions
Configure disk eviction auto-healing in the policy’s ephemeral storage settings. For GitOps, use spec.autoHealing.enabledByResource["ephemeral-storage"]: true.
Burst Reaction
Burst Reaction detects sudden spikes in CPU or memory usage and temporarily increases resource recommendations to handle the load. It reacts within minutes when usage consistently exceeds current recommendations.
What It Does
- Rapid Response: Reacts within minutes when usage consistently exceeds current recommendations
- Temporary Override: Increases resource recommendations temporarily to handle the burst
- Automatic Expiry: The override automatically expires after a set duration, returning to normal recommendations
- Burst Handling: Helps workloads handle traffic spikes, batch jobs, and other bursts without manual intervention
This mechanism reduces the risk of performance issues or outages during bursts while maintaining cost efficiency during normal operation.
Feature Configuration
Auto healing and Burst Reaction are enabled by default and can be configured through the rightsizing policy interface.

Example: CPU Stress Auto Healing Workflow
The following example demonstrates the CPU Stress Auto Healing process as observed in the Workload Overview:
1. Stress Detection
ScaleOps detects a node experiencing CPU stress conditions.

2. Resource Optimization
ScaleOps applies increased CPU recommendations to resolve the node stress condition and restore optimal performance.

3. Event Tracking
Healing events are recorded in the workload overview timeline and Events tab for comprehensive monitoring and analysis.


Summary
ScaleOps provides two real-time features—Auto Healing and Burst Reaction—that respond to resource anomalies and usage spikes without manual intervention. Together they maintain workload health and cluster stability while keeping workloads optimally sized for both normal operation and sudden load changes.