Understanding the Impact of ScaleOps Component Downtime

Overview

At ScaleOps, we understand that uptime and reliability are crucial for your operations. This document provides a clear overview of what happens if any of the key components of ScaleOps experiences downtime. The key takeaway is that running workloads will not be impacted, and they will continue to operate with the optimized resource requests previously set by ScaleOps. The only potential impact is on newly created pods, which may receive stale recommendations or default to their original requests.

Component Failure Scenarios and Impact

Component	Failure Scenario	Side Effect
Recommender	The Recommender is down	Recommendation updates might be delayed or blocked.
Updater	The Updater is down	Optimization for running pods might be delayed or blocked.
Prometheus	Prometheus is down for more than 3 hours	Recommendation updates might be delayed.
Admission Controller	The Admission Controller is down	Newly created pods will be created with their original resource requests.
Essential Metrics (kube-state-metrics)	Missing essential metrics	Recommendation updates might be blocked.
Dashboard	The ScaleOps UI (dashboard) is down	The ScaleOps UI will be inaccessible, but running workloads remain unaffected.

Detailed Explanation

Recommender Down

The recommender is responsible for analyzing metrics and generating updated resource recommendations for your workloads. If the recommender is down, updates to recommendations may be delayed or blocked. However, your running workloads will continue to operate with the last applied optimized settings. Newly created pods may receive stale recommendations until the recommender is back online.

Updater Down

The updater is responsible for identifying Pods that require an update in their resource allocation and scheduling them for an update.If the updater is down, the optimization of running pods may be delayed or blocked. Despite this, your existing workloads will not be impacted and will continue running with their current resource allocations.

Prometheus Down for More Than 3 Hours

Prometheus is essential for collecting resource based metrics that are used to calculate the recommendations. If Prometheus is down for more than 3 hours, recommendation updates might be delayed due to a lack of fresh data. However, this will not affect the performance of your running workloads, as they will continue to operate with their last set resource requests.

Admission Controller Down

ScaleOps Admissions controller is responsible for modifying the workload’s resource requests according to a specific allocation recommendation. If the admission controller is down, newly created pods will default to their original resource requests instead of the optimized ones. Existing pods and their performance will not be affected.

Missing Essential Metrics / kube-state-metrics

The availability of essential metrics, such as those provided by kube-state-metrics, is critical for generating accurate recommendations. If these metrics are missing, recommendation updates might be blocked. However, this does not impact running workloads, which will continue with their last applied settings.

Dashboard Down

If the ScaleOps UI dashboard is down, you will not be able to access the UI to view or perform manual actions. However, you’ll still be able to control everything via GitOps and your workloads will continue to run unaffected without affecting the ongoing optimization.

Conclusion

In the event of a component failure within ScaleOps, the most critical thing to remember is that your running workloads will remain unaffected. They will continue to run with the optimized resource allocations already in place. Any potential side effects will be limited to newly created pods or the ability to view or update recommendations, but these are temporary and will resolve once the affected component’s health is restored.