Troubleshoot
Overview
The ScaleOps Troubleshooting page is a comprehensive dashboard designed to help you identify, diagnose, and resolve issues within your Kubernetes cluster/s. This page provides real-time visibility into performance bottlenecks, resource waste, automation events, and system health metrics across your clusters.
The troubleshooting page provides the following features:
- Interactive Dashboard: Customizable layout with draggable and resizable graphs
- Time Range Selection: Analyze data over different time periods
- Multi-Cluster Support: View metrics across multiple clusters simultaneously
- Filtering Options: Filter by namespaces, labels, and annotations
- Export Capabilities: Copy graph data and create custom dashboards
- Real-time Updates: Live data refresh for current cluster status
Pre-Built Dashboards
The troubleshooting page provides the following pre-built dashboards:
- Performance: Identifies performance issues and resource constraints affecting your workloads.
- Overall Costs: Identifies resource waste and costs across the cluster.
- ScaleOps Health: Monitors the health and performance of ScaleOps components.
- Advanced Performance: Provides detailed node-level performance metrics including CPU/memory stalling, network throughput, disk I/O and PSI graphs.

Chart Categories
The charts are organized into the following categories:
- Optimization: Focuses on ScaleOps automation activities and optimization events.
- Performance: Identifies performance issues and resource constraints affecting your workloads.
- Replicas: Tracks HPA scaling events and pod count changes.
- Cost: Identifies resource waste and cost optimization opportunities.
- Nodes: Provides insights into node-level performance and lifecycle management.
- ScaleOps Workloads: Monitors the health and performance of ScaleOps components themselves.
- Resource Quotas: Monitors namespace-level resource constraints and limitations.
- Pressure Stall (PSI): Monitors resource contention and wait times at the node level.
- Node I/O: Tracks network and disk I/O metrics across nodes.
Optimization
This category focuses on ScaleOps automation activities and optimization events.
- Purpose: Tracks all automation events performed by ScaleOps
- What it shows: Number and frequency of automated actions taken by the system
- Purpose: Shows pods that have been optimized by ScaleOps
- What it shows: Count of pods with automation and total number of pods
- Purpose: Displays workloads currently under ScaleOps automation
- What it shows: Number of workloads with automation and total number of workloads
- Purpose: Tracks workloads that have been downscaled
- What it shows: Workloads where resources have been reduced based on actual usage
Performance
This category identifies performance issues and resource constraints affecting your workloads.
- Purpose: Identifies workloads with insufficient CPU resources on high-utilization nodes
- What it shows: Workloads experiencing CPU pressure due to resource constraints
- Purpose: Identifies workloads with insufficient memory resources on high-utilization nodes
- What it shows: Workloads experiencing memory pressure due to resource constraints
- Purpose: Shows workloads with insufficient CPU requests across all nodes
- What it shows: Workloads that consistently need more CPU than allocated
- Purpose: Shows workloads with insufficient memory requests across all nodes
- What it shows: Workloads that consistently need more memory than allocated
- Purpose: Tracks various types of workload disruptions and evictions
- What it shows: Disruptions caused by Karpenter, Cluster Autoscaler, VPA, and other tools
- Purpose: Identifies workloads with the most disruptions
- What it shows: Top workloads by disruption count
- Purpose: Tracks node-level disruption events
- What it shows: Node scaled down, Taint eviction, Node removed, Spot interruption
- Purpose: Monitors periods when workloads are unavailable
- What it shows: Frequency and duration of downtime across workloads
- Purpose: Tracks memory-related failures and OOM kills
- What it shows: Workloads that have been terminated due to memory exhaustion
- Purpose: Monitors CPU throttling due to resource limits
- What it shows: Workloads experiencing CPU throttling from resource constraints
- Purpose: Tracks health check failures across workloads
- What it shows: Workloads with failing liveness probes indicating health issues
- Purpose: Monitors ScaleOps auto-healing activities
- What it shows: Auto healing and burst reaction events over time
- Purpose: Shows CPU request distribution across workload types
- What it shows: CPU requests breakdown by Deployment, StatefulSet, DaemonSet, etc.
- Purpose: Shows memory request distribution across workload types
- What it shows: Memory requests breakdown by Deployment, StatefulSet, DaemonSet, etc.
- Purpose: Measures total time for pods to become ready
- What it shows: End-to-end pod startup duration from creation to ready state
- Purpose: Tracks time spent in pod scheduling phase
- What it shows: Duration from pod creation to being scheduled on a node
- Purpose: Monitors container image pull duration
- What it shows: Time spent pulling container images during pod startup
- Purpose: Tracks init container execution time
- What it shows: Duration of init container runs before main containers start
- Purpose: Measures time from container running to pod ready
- What it shows: Application startup time after containers begin running
Replicas
This category tracks HPA scaling events and pod count changes.
- Purpose: Monitors changes to HPA resource triggers
- What it shows: Events when HPA scaling thresholds or targets are modified
- Purpose: Tracks HPA-initiated scaling operations
- What it shows: Scale up and scale down events triggered by HPA
- Purpose: Shows total pod count over time
- What it shows: Number of running pods over time
Cost
This category helps identify resource waste and cost optimization opportunities.
- Purpose: Shows total allocatable CPU capacity across the cluster
- What it shows: The allocatable CPU over time
- Purpose: Shows total allocatable memory capacity across the cluster
- What it shows: The allocatable memory over time
- Purpose: Tracks increases in CPU requests over time
- What it shows: Workloads with growing CPU resource requirements
- Purpose: Tracks increases in memory requests over time
- What it shows: Workloads with growing memory resource requirements
- Purpose: Monitors replica count increases
- What it shows: Workloads scaling up their replica counts
- Purpose: Identifies workloads with excessive resource allocation
- What it shows: Monthly cost of over-provisioned resources by workload or namespace
- Purpose: Highlights the most costly workloads in your cluster
- What it shows: Top spenders by workload or namespace
- Purpose: Shows waste from workloads not following ScaleOps recommendations
- What it shows: Resources wasted due to ignoring optimization suggestions
- Purpose: Identifies workloads that ScaleOps cannot categorize
- What it shows: Workloads without proper labeling or identification
- Purpose: Quantifies CPU resources that are allocated but unused
- What it shows: CPU cores wasted due to over-provisioning
- Purpose: Quantifies memory resources that are allocated but unused
- What it shows: Memory wasted due to over-provisioning
- Purpose: Shows CPU waste from not following ScaleOps recommendations
- What it shows: Additional CPU costs from ignoring optimization advice
- Purpose: Shows memory waste from not following ScaleOps recommendations
- What it shows: Additional memory costs from ignoring optimization advice
- Purpose: Shows CPU overhead from init containers
- What it shows: Additional CPU requested by init containers vs main containers
- Purpose: Shows memory overhead from init containers
- What it shows: Additional memory requested by init containers vs main containers
Nodes
This category provides insights into node-level performance and lifecycle management.
- Purpose: Shows CPU usage across all nodes in the cluster
- What it shows: Percentage of CPU utilization and allocation per node
- Purpose: Shows aggregated CPU utilization statistics across nodes
- What it shows: Average, p90, p99, and max CPU utilization across all nodes
- Purpose: Shows memory usage across all nodes in the cluster
- What it shows: Percentage of memory utilization and allocation per node
- Purpose: Shows aggregated memory utilization statistics across nodes
- What it shows: Average, p90, p99, and max memory utilization across all nodes
- Purpose: Shows ephemeral storage usage across nodes
- What it shows: Percentage of ephemeral storage utilized per node
- Purpose: Shows how CPU resources are allocated across nodes
- What it shows: CPU requests and limits distribution per node
- Purpose: Shows how memory resources are allocated across nodes
- What it shows: Memory requests and limits distribution per node
- Purpose: Shows ephemeral storage allocation across nodes
- What it shows: Ephemeral storage requests distribution per node
- Purpose: Identifies why nodes are not being scaled down
- What it shows: Reasons preventing node removal (e.g., pod disruption budgets, taints)
- Purpose: Shows CPU capacity blocked by various reasons
- What it shows: CPU resources unavailable due to taints, affinity rules, or other constraints
- Purpose: Shows memory capacity blocked by various reasons
- What it shows: Memory resources unavailable due to taints, affinity rules, or other constraints
- Purpose: Tracks node creation, termination, and lifecycle events
- What it shows: Node lifecycle events and their frequency
- Purpose: Shows the distribution of node instance types
- What it shows: Different instance types and their usage patterns
- Purpose: Monitors node health conditions
- What it shows: Node conditions like Ready, MemoryPressure, DiskPressure, PIDPressure
- Purpose: Shows pod distribution across nodes
- What it shows: Number of pods running on each node
- Purpose: Shows aggregated pod count statistics across nodes
- What it shows: Average, p90, p99, and max pod count across all nodes
- Purpose: Shows pod count relative to node CPU capacity
- What it shows: Pod density based on CPU allocation per node
- Purpose: Shows pod count relative to node memory capacity
- What it shows: Pod density based on memory allocation per node
- Purpose: Shows pod count relative to ephemeral storage capacity
- What it shows: Pod density based on ephemeral storage per node
- Purpose: Shows node distribution across node pools
- What it shows: Number of nodes in each node pool or node group
- Purpose: Tracks Karpenter consolidation activities
- What it shows: Reasons for Karpenter-initiated node consolidation events
- Purpose: Shows CPU allocation across Karpenter node pools
- What it shows: CPU distribution and capacity per Karpenter NodePool
- Purpose: Shows memory allocation across Karpenter node pools
- What it shows: Memory distribution and capacity per Karpenter NodePool
ScaleOps Workloads
This category monitors the health and performance of ScaleOps components themselves.
- Purpose: Shows the version distribution of ScaleOps components
- What it shows: Different versions of ScaleOps workloads running in the cluster
- Purpose: Monitors CPU consumption by ScaleOps components
- What it shows: CPU usage patterns of ScaleOps workloads over time
- Purpose: Monitors memory consumption by ScaleOps components
- What it shows: Memory usage patterns of ScaleOps workloads over time
- Purpose: Shows CPU resource requests for ScaleOps components
- What it shows: CPU allocation for ScaleOps workloads
- Purpose: Shows memory resource requests for ScaleOps components
- What it shows: Memory allocation for ScaleOps workloads
- Purpose: Identifies problems with ScaleOps components
- What it shows: Issues affecting ScaleOps workload health and performance
- Purpose: Monitors Prometheus storage usage
- What it shows: Current storage consumption by ScaleOps Prometheus instance
- Purpose: Shows Prometheus data retention configuration
- What it shows: Volume size and retention settings for metrics storage
Resource Quotas
This category monitors namespace-level resource constraints and limitations.
- Purpose: Shows CPU request quota usage across namespaces
- What it shows: Percentage of CPU request quota consumed per namespace
- Purpose: Shows memory request quota usage across namespaces
- What it shows: Percentage of memory request quota consumed per namespace
- Purpose: Shows CPU limit quota usage across namespaces
- What it shows: Percentage of CPU limit quota consumed per namespace
- Purpose: Shows memory limit quota usage across namespaces
- What it shows: Percentage of memory limit quota consumed per namespace
- Purpose: Shows pod count limitations across namespaces
- What it shows: Number of pods relative to namespace pod limits
- Purpose: Shows replica set limitations across namespaces
- What it shows: Number of replica sets relative to namespace limits
Pressure Stall (PSI)
This category monitors resource contention and wait times at the node level using Linux Pressure Stall Information.
- Purpose: Monitors CPU pressure stall information
- What it shows: Percentage of time processes are waiting for CPU resources
- Purpose: Monitors memory pressure stall information
- What it shows: Percentage of time processes are waiting for memory resources
- Purpose: Monitors I/O pressure stall information
- What it shows: Percentage of time processes are waiting for disk I/O operations
Node I/O
This category tracks network and disk I/O metrics across nodes.
- Purpose: Shows network bandwidth usage per node
- What it shows: Incoming and outgoing network traffic rates
- Purpose: Shows aggregated network throughput statistics
- What it shows: Average, p90, p99, and max network throughput across all nodes
- Purpose: Monitors network packet drops
- What it shows: Number of dropped packets per node indicating network issues
- Purpose: Shows disk read/write bandwidth per node
- What it shows: Disk throughput rates for read and write operations
- Purpose: Shows aggregated disk throughput statistics
- What it shows: Average, p90, p99, and max disk throughput across all nodes
- Purpose: Shows disk I/O operations per second per node
- What it shows: Read and write IOPS for each node
- Purpose: Shows aggregated disk IOPS statistics
- What it shows: Average, p90, p99, and max IOPS across all nodes
How to Use the Troubleshooting Page
Getting Started
- Select Time Range: Choose an appropriate time period for analysis
- Choose Dashboard: Select from predefined dashboards or create custom ones
- Filter Data: Use namespace, label, and annotation filters to focus on specific workloads
- Select Charts: Choose relevant graphs from the chart selector based on your investigation needs

Exporting and Sharing
- Copy Data: Click on graph elements to copy specific data points
- Save Dashboards: Create and save custom dashboard configurations
- Duplicate Dashboards: Create variations of dashboards for different use cases
Support
If you encounter issues with the troubleshooting page or need help interpreting the data, please contact your ScaleOps support team. The troubleshooting page is designed to provide comprehensive visibility into your cluster health and help you make informed decisions about optimization and resource management.