Troubleshoot

Overview

The ScaleOps Troubleshooting page is a comprehensive dashboard designed to help you identify, diagnose, and resolve issues within your Kubernetes cluster/s. This page provides real-time visibility into performance bottlenecks, resource waste, automation events, and system health metrics across your clusters.

The troubleshooting page provides the following features:

Interactive Dashboard: Customizable layout with draggable and resizable graphs
Time Range Selection: Analyze data over different time periods
Multi-Cluster Support: View metrics across multiple clusters simultaneously
Filtering Options: Filter by namespaces, labels, and annotations
Export Capabilities: Copy graph data and create custom dashboards
Real-time Updates: Live data refresh for current cluster status

Pre-Built Dashboards

The troubleshooting page provides the following pre-built dashboards:

Performance: Identifies performance issues and resource constraints affecting your workloads.
Overall Costs: Identifies resource waste and costs across the cluster.
ScaleOps Health: Monitors the health and performance of ScaleOps components.
Advanced Performance: Provides detailed node-level performance metrics including CPU/memory stalling, network throughput, disk I/O and PSI graphs.

Dashboard

Chart Categories

The charts are organized into the following categories:

Optimization: Focuses on ScaleOps automation activities and optimization events.
Performance: Identifies performance issues and resource constraints affecting your workloads.
Replicas: Tracks HPA scaling events and pod count changes.
Cost: Identifies resource waste and cost optimization opportunities.
Nodes: Provides insights into node-level performance and lifecycle management.
ScaleOps Workloads: Monitors the health and performance of ScaleOps components themselves.
Resource Quotas: Monitors namespace-level resource constraints and limitations.
Pressure Stall (PSI): Monitors resource contention and wait times at the node level.
Node I/O: Tracks network and disk I/O metrics across nodes.

Optimization

This category focuses on ScaleOps automation activities and optimization events.

Automation Events

Purpose: Tracks all automation events performed by ScaleOps
What it shows: Number and frequency of automated actions taken by the system

Optimized Pods

Purpose: Shows pods that have been optimized by ScaleOps
What it shows: Count of pods with automation and total number of pods

Automated Workloads

Purpose: Displays workloads currently under ScaleOps automation
What it shows: Number of workloads with automation and total number of workloads

Downscaled Workloads

Purpose: Tracks workloads that have been downscaled
What it shows: Workloads where resources have been reduced based on actual usage

Performance

This category identifies performance issues and resource constraints affecting your workloads.

CPU Under Provisioned on Stressed Nodes

Purpose: Identifies workloads with insufficient CPU resources on high-utilization nodes
What it shows: Workloads experiencing CPU pressure due to resource constraints

Memory Under Provisioned on Stressed Nodes

Purpose: Identifies workloads with insufficient memory resources on high-utilization nodes
What it shows: Workloads experiencing memory pressure due to resource constraints

CPU Under Provisioned

Purpose: Shows workloads with insufficient CPU requests across all nodes
What it shows: Workloads that consistently need more CPU than allocated

Memory Under Provisioned

Purpose: Shows workloads with insufficient memory requests across all nodes
What it shows: Workloads that consistently need more memory than allocated

Workload Disruptions

Purpose: Tracks various types of workload disruptions and evictions
What it shows: Disruptions caused by Karpenter, Cluster Autoscaler, VPA, and other tools

Most Disruptive Workloads

Purpose: Identifies workloads with the most disruptions
What it shows: Top workloads by disruption count

Node Disruptions

Purpose: Tracks node-level disruption events
What it shows: Node scaled down, Taint eviction, Node removed, Spot interruption

Downtime Events

Purpose: Monitors periods when workloads are unavailable
What it shows: Frequency and duration of downtime across workloads

Out of Memory Events

Purpose: Tracks memory-related failures and OOM kills
What it shows: Workloads that have been terminated due to memory exhaustion

CPU Throttling Events

Purpose: Monitors CPU throttling due to resource limits
What it shows: Workloads experiencing CPU throttling from resource constraints

Liveness Probe Failures

Purpose: Tracks health check failures across workloads
What it shows: Workloads with failing liveness probes indicating health issues

ScaleOps Healing Statuses

Purpose: Monitors ScaleOps auto-healing activities
What it shows: Auto healing and burst reaction events over time

CPU Requests by Workload Type

Purpose: Shows CPU request distribution across workload types
What it shows: CPU requests breakdown by Deployment, StatefulSet, DaemonSet, etc.

Memory Requests by Workload Type

Purpose: Shows memory request distribution across workload types
What it shows: Memory requests breakdown by Deployment, StatefulSet, DaemonSet, etc.

Workload Pod Time to Ready

Purpose: Measures total time for pods to become ready
What it shows: End-to-end pod startup duration from creation to ready state

Workload Pod Scheduling Time

Purpose: Tracks time spent in pod scheduling phase
What it shows: Duration from pod creation to being scheduled on a node

Workload Pod Image Pulling Time

Purpose: Monitors container image pull duration
What it shows: Time spent pulling container images during pod startup

Workload Pod Init Container Runtime

Purpose: Tracks init container execution time
What it shows: Duration of init container runs before main containers start

Workload Pod Running to Ready

Purpose: Measures time from container running to pod ready
What it shows: Application startup time after containers begin running

Replicas

This category tracks HPA scaling events and pod count changes.

HPA Resource Trigger Change Events

Purpose: Monitors changes to HPA resource triggers
What it shows: Events when HPA scaling thresholds or targets are modified

HPA Scale Events

Purpose: Tracks HPA-initiated scaling operations
What it shows: Scale up and scale down events triggered by HPA

Pod Count

Purpose: Shows total pod count over time
What it shows: Number of running pods over time

Cost

This category helps identify resource waste and cost optimization opportunities.

CPU Allocatable

Purpose: Shows total allocatable CPU capacity across the cluster
What it shows: The allocatable CPU over time

Memory Allocatable

Purpose: Shows total allocatable memory capacity across the cluster
What it shows: The allocatable memory over time

CPU Request Increase

Purpose: Tracks increases in CPU requests over time
What it shows: Workloads with growing CPU resource requirements

Memory Request Increase

Purpose: Tracks increases in memory requests over time
What it shows: Workloads with growing memory resource requirements

Replicas Increase

Purpose: Monitors replica count increases
What it shows: Workloads scaling up their replica counts

Wasteful

Purpose: Identifies workloads with excessive resource allocation
What it shows: Monthly cost of over-provisioned resources by workload or namespace

Expensive

Purpose: Highlights the most costly workloads in your cluster
What it shows: Top spenders by workload or namespace

Smart Policy Waste

Purpose: Shows waste from workloads not following ScaleOps recommendations
What it shows: Resources wasted due to ignoring optimization suggestions

Custom Workloads Unidentified

Purpose: Identifies workloads that ScaleOps cannot categorize
What it shows: Workloads without proper labeling or identification

Wasted CPU

Purpose: Quantifies CPU resources that are allocated but unused
What it shows: CPU cores wasted due to over-provisioning

Wasted Memory

Purpose: Quantifies memory resources that are allocated but unused
What it shows: Memory wasted due to over-provisioning

CPU Wasted due to Unsuggested Policies

Purpose: Shows CPU waste from not following ScaleOps recommendations
What it shows: Additional CPU costs from ignoring optimization advice

Memory Wasted due to Unsuggested Policies

Purpose: Shows memory waste from not following ScaleOps recommendations
What it shows: Additional memory costs from ignoring optimization advice

Init Container CPU Request Overhead

Purpose: Shows CPU overhead from init containers
What it shows: Additional CPU requested by init containers vs main containers

Init Container Memory Request Overhead

Purpose: Shows memory overhead from init containers
What it shows: Additional memory requested by init containers vs main containers

Nodes

This category provides insights into node-level performance and lifecycle management.

Node CPU Utilization

Purpose: Shows CPU usage across all nodes in the cluster
What it shows: Percentage of CPU utilization and allocation per node

Node CPU Utilization (Aggregated)

Purpose: Shows aggregated CPU utilization statistics across nodes
What it shows: Average, p90, p99, and max CPU utilization across all nodes

Node Memory Utilization

Purpose: Shows memory usage across all nodes in the cluster
What it shows: Percentage of memory utilization and allocation per node

Node Memory Utilization (Aggregated)

Purpose: Shows aggregated memory utilization statistics across nodes
What it shows: Average, p90, p99, and max memory utilization across all nodes

Node Ephemeral Storage Utilization

Purpose: Shows ephemeral storage usage across nodes
What it shows: Percentage of ephemeral storage utilized per node

Node CPU Allocation

Purpose: Shows how CPU resources are allocated across nodes
What it shows: CPU requests and limits distribution per node

Node Memory Allocation

Purpose: Shows how memory resources are allocated across nodes
What it shows: Memory requests and limits distribution per node

Node Ephemeral Storage Allocation

Purpose: Shows ephemeral storage allocation across nodes
What it shows: Ephemeral storage requests distribution per node

Node not Scaling Down Reason

Purpose: Identifies why nodes are not being scaled down
What it shows: Reasons preventing node removal (e.g., pod disruption budgets, taints)

Node Allocatable CPU Blocked by Reason

Purpose: Shows CPU capacity blocked by various reasons
What it shows: CPU resources unavailable due to taints, affinity rules, or other constraints

Node Allocatable Memory Blocked by Reason

Purpose: Shows memory capacity blocked by various reasons
What it shows: Memory resources unavailable due to taints, affinity rules, or other constraints

Node Life Cycle

Purpose: Tracks node creation, termination, and lifecycle events
What it shows: Node lifecycle events and their frequency

Node Instance Type

Purpose: Shows the distribution of node instance types
What it shows: Different instance types and their usage patterns

Node Conditions

Purpose: Monitors node health conditions
What it shows: Node conditions like Ready, MemoryPressure, DiskPressure, PIDPressure

Pods per Node

Purpose: Shows pod distribution across nodes
What it shows: Number of pods running on each node

Pods per Node (Aggregated)

Purpose: Shows aggregated pod count statistics across nodes
What it shows: Average, p90, p99, and max pod count across all nodes

Pods per Node (CPU Size)

Purpose: Shows pod count relative to node CPU capacity
What it shows: Pod density based on CPU allocation per node

Pods per Node (Memory Size)

Purpose: Shows pod count relative to node memory capacity
What it shows: Pod density based on memory allocation per node

Pods per Node (Ephemeral Storage Size)

Purpose: Shows pod count relative to ephemeral storage capacity
What it shows: Pod density based on ephemeral storage per node

Nodes per Node Pool

Purpose: Shows node distribution across node pools
What it shows: Number of nodes in each node pool or node group

Karpenter Node Consolidation Reasons

Purpose: Tracks Karpenter consolidation activities
What it shows: Reasons for Karpenter-initiated node consolidation events

Karpenter Node Pool CPU Allocation

Purpose: Shows CPU allocation across Karpenter node pools
What it shows: CPU distribution and capacity per Karpenter NodePool

Karpenter Node Pool Memory Allocation

Purpose: Shows memory allocation across Karpenter node pools
What it shows: Memory distribution and capacity per Karpenter NodePool

ScaleOps Workloads

This category monitors the health and performance of ScaleOps components themselves.

Version

Purpose: Shows the version distribution of ScaleOps components
What it shows: Different versions of ScaleOps workloads running in the cluster

ScaleOps CPU Usage

Purpose: Monitors CPU consumption by ScaleOps components
What it shows: CPU usage patterns of ScaleOps workloads over time

ScaleOps Memory Usage

Purpose: Monitors memory consumption by ScaleOps components
What it shows: Memory usage patterns of ScaleOps workloads over time

ScaleOps CPU Requests

Purpose: Shows CPU resource requests for ScaleOps components
What it shows: CPU allocation for ScaleOps workloads

ScaleOps Memory Requests

Purpose: Shows memory resource requests for ScaleOps components
What it shows: Memory allocation for ScaleOps workloads

ScaleOps Workloads Issues

Purpose: Identifies problems with ScaleOps components
What it shows: Issues affecting ScaleOps workload health and performance

ScaleOps Prometheus Volume

Purpose: Monitors Prometheus storage usage
What it shows: Current storage consumption by ScaleOps Prometheus instance

ScaleOps Prometheus Retention

Purpose: Shows Prometheus data retention configuration
What it shows: Volume size and retention settings for metrics storage

Resource Quotas

This category monitors namespace-level resource constraints and limitations.

Namespace Limitation by CPU Requests

Purpose: Shows CPU request quota usage across namespaces
What it shows: Percentage of CPU request quota consumed per namespace

Namespace Limitation by Memory Requests

Purpose: Shows memory request quota usage across namespaces
What it shows: Percentage of memory request quota consumed per namespace

Namespace Limitation by CPU Limits

Purpose: Shows CPU limit quota usage across namespaces
What it shows: Percentage of CPU limit quota consumed per namespace

Namespace Limitation by Memory Limits

Purpose: Shows memory limit quota usage across namespaces
What it shows: Percentage of memory limit quota consumed per namespace

Namespace Limitation by Pods

Purpose: Shows pod count limitations across namespaces
What it shows: Number of pods relative to namespace pod limits

Namespace Limitation by Replica Sets

Purpose: Shows replica set limitations across namespaces
What it shows: Number of replica sets relative to namespace limits

Pressure Stall (PSI)

This category monitors resource contention and wait times at the node level using Linux Pressure Stall Information.

Node CPU Wait Time (%)

Purpose: Monitors CPU pressure stall information
What it shows: Percentage of time processes are waiting for CPU resources

Node Memory Wait Time (%)

Purpose: Monitors memory pressure stall information
What it shows: Percentage of time processes are waiting for memory resources

Node Disk I/O Wait Time (%)

Purpose: Monitors I/O pressure stall information
What it shows: Percentage of time processes are waiting for disk I/O operations

Node I/O

This category tracks network and disk I/O metrics across nodes.

Node Network Throughput

Purpose: Shows network bandwidth usage per node
What it shows: Incoming and outgoing network traffic rates

Node Network Throughput (Aggregated)

Purpose: Shows aggregated network throughput statistics
What it shows: Average, p90, p99, and max network throughput across all nodes

Total Network Dropped Packets per Node

Purpose: Monitors network packet drops
What it shows: Number of dropped packets per node indicating network issues

Node Disk Throughput

Purpose: Shows disk read/write bandwidth per node
What it shows: Disk throughput rates for read and write operations

Node Disk Throughput (Aggregated)

Purpose: Shows aggregated disk throughput statistics
What it shows: Average, p90, p99, and max disk throughput across all nodes

Node Disk IOPS

Purpose: Shows disk I/O operations per second per node
What it shows: Read and write IOPS for each node

Node Disk IOPS (Aggregated)

Purpose: Shows aggregated disk IOPS statistics
What it shows: Average, p90, p99, and max IOPS across all nodes

How to Use the Troubleshooting Page

Getting Started

Select Time Range: Choose an appropriate time period for analysis
Choose Dashboard: Select from predefined dashboards or create custom ones
Filter Data: Use namespace, label, and annotation filters to focus on specific workloads
Select Charts: Choose relevant graphs from the chart selector based on your investigation needs

Filters

Copy Data: Click on graph elements to copy specific data points
Save Dashboards: Create and save custom dashboard configurations
Duplicate Dashboards: Create variations of dashboards for different use cases

Support

If you encounter issues with the troubleshooting page or need help interpreting the data, please contact your ScaleOps support team. The troubleshooting page is designed to provide comprehensive visibility into your cluster health and help you make informed decisions about optimization and resource management.