Understanding the Impact of ScaleOps Component Downtime
Overview
At ScaleOps, we understand that uptime and reliability are crucial for your operations. This document provides a clear overview of what happens if any of the key components of ScaleOps experiences downtime. The key takeaway is that running workloads will not be impacted, and they will continue to operate with the optimized resource requests previously set by ScaleOps. The only potential impact is on newly created pods, which may receive stale recommendations or default to their original requests.
Component Failure Scenarios and Impact
| Component | Failure Scenario | Side Effect |
|---|---|---|
| Recommender | The Recommender is down | Recommendation updates might be delayed or blocked. |
| Updater | The Updater is down | Optimization for running pods might be delayed or blocked. |
| Prometheus | Prometheus is down for more than 3 hours | Recommendation updates might be delayed. |
| Admission Controller | The Admission Controller is down | Newly created pods will be created with their original resource requests. |
| Essential Metrics (kube-state-metrics) | Missing essential metrics | Recommendation updates might be blocked. |
| Dashboard | The ScaleOps UI (dashboard) is down | The ScaleOps UI will be inaccessible, but running workloads remain unaffected. |
Detailed Explanation
Recommender Down
The recommender is responsible for analyzing metrics and generating updated resource recommendations for your workloads. If the recommender is down, updates to recommendations may be delayed or blocked. However, your running workloads will continue to operate with the last applied optimized settings. Newly created pods may receive stale recommendations until the recommender is back online.
Updater Down
The updater is responsible for identifying Pods that require an update in their resource allocation and scheduling them for an update.If the updater is down, the optimization of running pods may be delayed or blocked. Despite this, your existing workloads will not be impacted and will continue running with their current resource allocations.
Prometheus Down for More Than 3 Hours
Prometheus is essential for collecting resource based metrics that are used to calculate the recommendations. If Prometheus is down for more than 3 hours, recommendation updates might be delayed due to a lack of fresh data. However, this will not affect the performance of your running workloads, as they will continue to operate with their last set resource requests.
Admission Controller Down
ScaleOps Admissions controller is responsible for modifying the workload’s resource requests according to a specific allocation recommendation. If the admission controller is down, newly created pods will default to their original resource requests instead of the optimized ones. Existing pods and their performance will not be affected.
Missing Essential Metrics / kube-state-metrics
The availability of essential metrics, such as those provided by kube-state-metrics, is critical for generating accurate recommendations. If these metrics are missing, recommendation updates might be blocked. However, this does not impact running workloads, which will continue with their last applied settings.
Dashboard Down
If the ScaleOps UI dashboard is down, you will not be able to access the UI to view or perform manual actions. However, you’ll still be able to control everything via GitOps and your workloads will continue to run unaffected without affecting the ongoing optimization.
Conclusion
In the event of a component failure within ScaleOps, the most critical thing to remember is that your running workloads will remain unaffected. They will continue to run with the optimized resource allocations already in place. Any potential side effects will be limited to newly created pods or the ability to view or update recommendations, but these are temporary and will resolve once the affected component’s health is restored.