Auto Healing and Burst Reaction

ScaleOps provides two real-time features that help workloads adapt to resource utilization anomalies and sudden load changes:

Auto Healing — Responds to CPU stress, OOM events, and disk evictions with proactive resource adjustments
Burst Reaction — Detects usage spikes and temporarily increases recommendations within minutes

Auto Healing

Auto Healing responds to CPU stress and OOM events with proactive resource adjustments to restore workload stability.

CPU Stress Auto Healing

ScaleOps continuously monitors cluster nodes for CPU stress conditions and implements proactive measures to alleviate stress and restore optimal workload performance.

What It Does

ScaleOps automatically identifies and resolves CPU stress conditions:

Resource Optimization: Automatically increases CPU requests for eligible pods to facilitate recovery and prevent recurring stress conditions
Permanent Recommendations: Provides recommendations for permanent CPU resource increases for workloads experiencing repeated CPU stress
Scheduling Optimization: Optimizes pod scheduling to distribute workload across nodes more effectively

Out-of-Memory (OOM) Auto Healing

ScaleOps monitors and responds to various OOM-related events to protect workload stability and prevent memory-related failures.

What It Does

ScaleOps automatically detects and responds to memory-related issues:

Memory Allocation Optimization: Automatically recommends increased memory allocation for affected workloads
Policy Compliance: Respects workload policies when determining healing actions
Usage-Based Recommendations: Calculates memory request increases based on recent usage patterns with additional margins for stability

Note: ScaleOps will not trigger OOM healing for limit-based OOM events unless the workload’s policy explicitly permits ScaleOps to manage resource limits.

OOM Event Types and Triggers

ScaleOps responds to multiple OOM-related event types that indicate memory pressure or failures:

Container OOMKilled (Exit Code 137) — Triggered when a container is terminated due to exceeding its memory limit or by the kernel OOM killer.
Kubelet Eviction: Node Memory Pressure — Occurs when the kubelet evicts pods due to node-wide memory pressure; ScaleOps applies increased memory recommendations to the restarted pod.
System OOM (OOM-System Event) Available in v1.16.6+ — Triggered when the node lacks sufficient memory for system processes, causing user container restarts (status 137).

Disk Eviction Auto-Healing

When a pod is evicted as a result of node disk stress, ScaleOps detects the eviction and increases the ephemeral storage recommendation for that workload so future pods avoid the same issue.

What It Does

Detects evictions caused by ephemeral storage limit violations
Increases ephemeral storage recommendations for affected workloads
Applies updated recommendations to new pods to prevent repeated evictions

Configure disk eviction auto-healing in the policy’s ephemeral storage settings. For GitOps, use spec.autoHealing.enabledByResource["ephemeral-storage"]: true.

Burst Reaction

Burst Reaction detects sudden spikes in CPU or memory usage and temporarily increases resource recommendations to handle the load. It reacts within minutes when usage consistently exceeds current recommendations.

What It Does

Rapid Response: Reacts within minutes when usage consistently exceeds current recommendations
Temporary Override: Increases resource recommendations temporarily to handle the burst
Automatic Expiry: The override automatically expires after a set duration, returning to normal recommendations
Burst Handling: Helps workloads handle traffic spikes, batch jobs, and other bursts without manual intervention

This mechanism reduces the risk of performance issues or outages during bursts while maintaining cost efficiency during normal operation.

Feature Configuration

Auto healing and Burst Reaction are enabled by default and can be configured through the rightsizing policy interface.

Auto Healing and Burst Reaction Policy

Example: CPU Stress Auto Healing Workflow

The following example demonstrates the CPU Stress Auto Healing process as observed in the Workload Overview:

1. Stress Detection

ScaleOps detects a node experiencing CPU stress conditions.

Auto Healing Policy

2. Resource Optimization

ScaleOps applies increased CPU recommendations to resolve the node stress condition and restore optimal performance.

Auto Healing Policy

3. Event Tracking

Healing events are recorded in the workload overview timeline and Events tab for comprehensive monitoring and analysis.

Auto Healing Policy

Summary

ScaleOps provides two real-time features—Auto Healing and Burst Reaction—that respond to resource anomalies and usage spikes without manual intervention. Together they maintain workload health and cluster stability while keeping workloads optimally sized for both normal operation and sudden load changes.