Skip to Content
TroubleshootingDashboard Troubleshooting

Troubleshoot

Overview

The ScaleOps Troubleshooting page is a comprehensive dashboard designed to help you identify, diagnose, and resolve issues within your Kubernetes cluster/s. This page provides real-time visibility into performance bottlenecks, resource waste, automation events, and system health metrics across your clusters.

The troubleshooting page provides the following features:

  • Interactive Dashboard: Customizable layout with draggable and resizable graphs
  • Time Range Selection: Analyze data over different time periods
  • Multi-Cluster Support: View metrics across multiple clusters simultaneously
  • Filtering Options: Filter by namespaces, labels, and annotations
  • Export Capabilities: Copy graph data and create custom dashboards
  • Real-time Updates: Live data refresh for current cluster status

Pre-Built Dashboards

The troubleshooting page provides the following pre-built dashboards:

  • Performance: Identifies performance issues and resource constraints affecting your workloads.
  • Overall Costs: Identifies resource waste and costs across the cluster.
  • ScaleOps Health: Monitors the health and performance of ScaleOps components.
  • Advanced Performance: Provides detailed node-level performance metrics including CPU/memory stalling, network throughput, disk I/O and PSI graphs.

Dashboard

Chart Categories

The charts are organized into the following categories:

  • Optimization: Focuses on ScaleOps automation activities and optimization events.
  • Performance: Identifies performance issues and resource constraints affecting your workloads.
  • Replicas: Tracks HPA scaling events and pod count changes.
  • Cost: Identifies resource waste and cost optimization opportunities.
  • Nodes: Provides insights into node-level performance and lifecycle management.
  • ScaleOps Workloads: Monitors the health and performance of ScaleOps components themselves.
  • Resource Quotas: Monitors namespace-level resource constraints and limitations.
  • Pressure Stall (PSI): Monitors resource contention and wait times at the node level.
  • Node I/O: Tracks network and disk I/O metrics across nodes.

Optimization

This category focuses on ScaleOps automation activities and optimization events.

Automation Events
  • Purpose: Tracks all automation events performed by ScaleOps
  • What it shows: Number and frequency of automated actions taken by the system
Optimized Pods
  • Purpose: Shows pods that have been optimized by ScaleOps
  • What it shows: Count of pods with automation and total number of pods
Automated Workloads
  • Purpose: Displays workloads currently under ScaleOps automation
  • What it shows: Number of workloads with automation and total number of workloads
Downscaled Workloads
  • Purpose: Tracks workloads that have been downscaled
  • What it shows: Workloads where resources have been reduced based on actual usage

Performance

This category identifies performance issues and resource constraints affecting your workloads.

CPU Under Provisioned on Stressed Nodes
  • Purpose: Identifies workloads with insufficient CPU resources on high-utilization nodes
  • What it shows: Workloads experiencing CPU pressure due to resource constraints
Memory Under Provisioned on Stressed Nodes
  • Purpose: Identifies workloads with insufficient memory resources on high-utilization nodes
  • What it shows: Workloads experiencing memory pressure due to resource constraints
CPU Under Provisioned
  • Purpose: Shows workloads with insufficient CPU requests across all nodes
  • What it shows: Workloads that consistently need more CPU than allocated
Memory Under Provisioned
  • Purpose: Shows workloads with insufficient memory requests across all nodes
  • What it shows: Workloads that consistently need more memory than allocated
Workload Disruptions
  • Purpose: Tracks various types of workload disruptions and evictions
  • What it shows: Disruptions caused by Karpenter, Cluster Autoscaler, VPA, and other tools
Most Disruptive Workloads
  • Purpose: Identifies workloads with the most disruptions
  • What it shows: Top workloads by disruption count
Node Disruptions
  • Purpose: Tracks node-level disruption events
  • What it shows: Node scaled down, Taint eviction, Node removed, Spot interruption
Downtime Events
  • Purpose: Monitors periods when workloads are unavailable
  • What it shows: Frequency and duration of downtime across workloads
Out of Memory Events
  • Purpose: Tracks memory-related failures and OOM kills
  • What it shows: Workloads that have been terminated due to memory exhaustion
CPU Throttling Events
  • Purpose: Monitors CPU throttling due to resource limits
  • What it shows: Workloads experiencing CPU throttling from resource constraints
Liveness Probe Failures
  • Purpose: Tracks health check failures across workloads
  • What it shows: Workloads with failing liveness probes indicating health issues
ScaleOps Healing Statuses
  • Purpose: Monitors ScaleOps auto-healing activities
  • What it shows: Auto healing and burst reaction events over time
CPU Requests by Workload Type
  • Purpose: Shows CPU request distribution across workload types
  • What it shows: CPU requests breakdown by Deployment, StatefulSet, DaemonSet, etc.
Memory Requests by Workload Type
  • Purpose: Shows memory request distribution across workload types
  • What it shows: Memory requests breakdown by Deployment, StatefulSet, DaemonSet, etc.
Workload Pod Time to Ready
  • Purpose: Measures total time for pods to become ready
  • What it shows: End-to-end pod startup duration from creation to ready state
Workload Pod Scheduling Time
  • Purpose: Tracks time spent in pod scheduling phase
  • What it shows: Duration from pod creation to being scheduled on a node
Workload Pod Image Pulling Time
  • Purpose: Monitors container image pull duration
  • What it shows: Time spent pulling container images during pod startup
Workload Pod Init Container Runtime
  • Purpose: Tracks init container execution time
  • What it shows: Duration of init container runs before main containers start
Workload Pod Running to Ready
  • Purpose: Measures time from container running to pod ready
  • What it shows: Application startup time after containers begin running

Replicas

This category tracks HPA scaling events and pod count changes.

HPA Resource Trigger Change Events
  • Purpose: Monitors changes to HPA resource triggers
  • What it shows: Events when HPA scaling thresholds or targets are modified
HPA Scale Events
  • Purpose: Tracks HPA-initiated scaling operations
  • What it shows: Scale up and scale down events triggered by HPA
Pod Count
  • Purpose: Shows total pod count over time
  • What it shows: Number of running pods over time

Cost

This category helps identify resource waste and cost optimization opportunities.

CPU Allocatable
  • Purpose: Shows total allocatable CPU capacity across the cluster
  • What it shows: The allocatable CPU over time
Memory Allocatable
  • Purpose: Shows total allocatable memory capacity across the cluster
  • What it shows: The allocatable memory over time
CPU Request Increase
  • Purpose: Tracks increases in CPU requests over time
  • What it shows: Workloads with growing CPU resource requirements
Memory Request Increase
  • Purpose: Tracks increases in memory requests over time
  • What it shows: Workloads with growing memory resource requirements
Replicas Increase
  • Purpose: Monitors replica count increases
  • What it shows: Workloads scaling up their replica counts
Wasteful
  • Purpose: Identifies workloads with excessive resource allocation
  • What it shows: Monthly cost of over-provisioned resources by workload or namespace
Expensive
  • Purpose: Highlights the most costly workloads in your cluster
  • What it shows: Top spenders by workload or namespace
Smart Policy Waste
  • Purpose: Shows waste from workloads not following ScaleOps recommendations
  • What it shows: Resources wasted due to ignoring optimization suggestions
Custom Workloads Unidentified
  • Purpose: Identifies workloads that ScaleOps cannot categorize
  • What it shows: Workloads without proper labeling or identification
Wasted CPU
  • Purpose: Quantifies CPU resources that are allocated but unused
  • What it shows: CPU cores wasted due to over-provisioning
Wasted Memory
  • Purpose: Quantifies memory resources that are allocated but unused
  • What it shows: Memory wasted due to over-provisioning
CPU Wasted due to Unsuggested Policies
  • Purpose: Shows CPU waste from not following ScaleOps recommendations
  • What it shows: Additional CPU costs from ignoring optimization advice
Memory Wasted due to Unsuggested Policies
  • Purpose: Shows memory waste from not following ScaleOps recommendations
  • What it shows: Additional memory costs from ignoring optimization advice
Init Container CPU Request Overhead
  • Purpose: Shows CPU overhead from init containers
  • What it shows: Additional CPU requested by init containers vs main containers
Init Container Memory Request Overhead
  • Purpose: Shows memory overhead from init containers
  • What it shows: Additional memory requested by init containers vs main containers

Nodes

This category provides insights into node-level performance and lifecycle management.

Node CPU Utilization
  • Purpose: Shows CPU usage across all nodes in the cluster
  • What it shows: Percentage of CPU utilization and allocation per node
Node CPU Utilization (Aggregated)
  • Purpose: Shows aggregated CPU utilization statistics across nodes
  • What it shows: Average, p90, p99, and max CPU utilization across all nodes
Node Memory Utilization
  • Purpose: Shows memory usage across all nodes in the cluster
  • What it shows: Percentage of memory utilization and allocation per node
Node Memory Utilization (Aggregated)
  • Purpose: Shows aggregated memory utilization statistics across nodes
  • What it shows: Average, p90, p99, and max memory utilization across all nodes
Node Ephemeral Storage Utilization
  • Purpose: Shows ephemeral storage usage across nodes
  • What it shows: Percentage of ephemeral storage utilized per node
Node CPU Allocation
  • Purpose: Shows how CPU resources are allocated across nodes
  • What it shows: CPU requests and limits distribution per node
Node Memory Allocation
  • Purpose: Shows how memory resources are allocated across nodes
  • What it shows: Memory requests and limits distribution per node
Node Ephemeral Storage Allocation
  • Purpose: Shows ephemeral storage allocation across nodes
  • What it shows: Ephemeral storage requests distribution per node
Node not Scaling Down Reason
  • Purpose: Identifies why nodes are not being scaled down
  • What it shows: Reasons preventing node removal (e.g., pod disruption budgets, taints)
Node Allocatable CPU Blocked by Reason
  • Purpose: Shows CPU capacity blocked by various reasons
  • What it shows: CPU resources unavailable due to taints, affinity rules, or other constraints
Node Allocatable Memory Blocked by Reason
  • Purpose: Shows memory capacity blocked by various reasons
  • What it shows: Memory resources unavailable due to taints, affinity rules, or other constraints
Node Life Cycle
  • Purpose: Tracks node creation, termination, and lifecycle events
  • What it shows: Node lifecycle events and their frequency
Node Instance Type
  • Purpose: Shows the distribution of node instance types
  • What it shows: Different instance types and their usage patterns
Node Conditions
  • Purpose: Monitors node health conditions
  • What it shows: Node conditions like Ready, MemoryPressure, DiskPressure, PIDPressure
Pods per Node
  • Purpose: Shows pod distribution across nodes
  • What it shows: Number of pods running on each node
Pods per Node (Aggregated)
  • Purpose: Shows aggregated pod count statistics across nodes
  • What it shows: Average, p90, p99, and max pod count across all nodes
Pods per Node (CPU Size)
  • Purpose: Shows pod count relative to node CPU capacity
  • What it shows: Pod density based on CPU allocation per node
Pods per Node (Memory Size)
  • Purpose: Shows pod count relative to node memory capacity
  • What it shows: Pod density based on memory allocation per node
Pods per Node (Ephemeral Storage Size)
  • Purpose: Shows pod count relative to ephemeral storage capacity
  • What it shows: Pod density based on ephemeral storage per node
Nodes per Node Pool
  • Purpose: Shows node distribution across node pools
  • What it shows: Number of nodes in each node pool or node group
Karpenter Node Consolidation Reasons
  • Purpose: Tracks Karpenter consolidation activities
  • What it shows: Reasons for Karpenter-initiated node consolidation events
Karpenter Node Pool CPU Allocation
  • Purpose: Shows CPU allocation across Karpenter node pools
  • What it shows: CPU distribution and capacity per Karpenter NodePool
Karpenter Node Pool Memory Allocation
  • Purpose: Shows memory allocation across Karpenter node pools
  • What it shows: Memory distribution and capacity per Karpenter NodePool

ScaleOps Workloads

This category monitors the health and performance of ScaleOps components themselves.

Version
  • Purpose: Shows the version distribution of ScaleOps components
  • What it shows: Different versions of ScaleOps workloads running in the cluster
ScaleOps CPU Usage
  • Purpose: Monitors CPU consumption by ScaleOps components
  • What it shows: CPU usage patterns of ScaleOps workloads over time
ScaleOps Memory Usage
  • Purpose: Monitors memory consumption by ScaleOps components
  • What it shows: Memory usage patterns of ScaleOps workloads over time
ScaleOps CPU Requests
  • Purpose: Shows CPU resource requests for ScaleOps components
  • What it shows: CPU allocation for ScaleOps workloads
ScaleOps Memory Requests
  • Purpose: Shows memory resource requests for ScaleOps components
  • What it shows: Memory allocation for ScaleOps workloads
ScaleOps Workloads Issues
  • Purpose: Identifies problems with ScaleOps components
  • What it shows: Issues affecting ScaleOps workload health and performance
ScaleOps Prometheus Volume
  • Purpose: Monitors Prometheus storage usage
  • What it shows: Current storage consumption by ScaleOps Prometheus instance
ScaleOps Prometheus Retention
  • Purpose: Shows Prometheus data retention configuration
  • What it shows: Volume size and retention settings for metrics storage

Resource Quotas

This category monitors namespace-level resource constraints and limitations.

Namespace Limitation by CPU Requests
  • Purpose: Shows CPU request quota usage across namespaces
  • What it shows: Percentage of CPU request quota consumed per namespace
Namespace Limitation by Memory Requests
  • Purpose: Shows memory request quota usage across namespaces
  • What it shows: Percentage of memory request quota consumed per namespace
Namespace Limitation by CPU Limits
  • Purpose: Shows CPU limit quota usage across namespaces
  • What it shows: Percentage of CPU limit quota consumed per namespace
Namespace Limitation by Memory Limits
  • Purpose: Shows memory limit quota usage across namespaces
  • What it shows: Percentage of memory limit quota consumed per namespace
Namespace Limitation by Pods
  • Purpose: Shows pod count limitations across namespaces
  • What it shows: Number of pods relative to namespace pod limits
Namespace Limitation by Replica Sets
  • Purpose: Shows replica set limitations across namespaces
  • What it shows: Number of replica sets relative to namespace limits

Pressure Stall (PSI)

This category monitors resource contention and wait times at the node level using Linux Pressure Stall Information.

Node CPU Wait Time (%)
  • Purpose: Monitors CPU pressure stall information
  • What it shows: Percentage of time processes are waiting for CPU resources
Node Memory Wait Time (%)
  • Purpose: Monitors memory pressure stall information
  • What it shows: Percentage of time processes are waiting for memory resources
Node Disk I/O Wait Time (%)
  • Purpose: Monitors I/O pressure stall information
  • What it shows: Percentage of time processes are waiting for disk I/O operations

Node I/O

This category tracks network and disk I/O metrics across nodes.

Node Network Throughput
  • Purpose: Shows network bandwidth usage per node
  • What it shows: Incoming and outgoing network traffic rates
Node Network Throughput (Aggregated)
  • Purpose: Shows aggregated network throughput statistics
  • What it shows: Average, p90, p99, and max network throughput across all nodes
Total Network Dropped Packets per Node
  • Purpose: Monitors network packet drops
  • What it shows: Number of dropped packets per node indicating network issues
Node Disk Throughput
  • Purpose: Shows disk read/write bandwidth per node
  • What it shows: Disk throughput rates for read and write operations
Node Disk Throughput (Aggregated)
  • Purpose: Shows aggregated disk throughput statistics
  • What it shows: Average, p90, p99, and max disk throughput across all nodes
Node Disk IOPS
  • Purpose: Shows disk I/O operations per second per node
  • What it shows: Read and write IOPS for each node
Node Disk IOPS (Aggregated)
  • Purpose: Shows aggregated disk IOPS statistics
  • What it shows: Average, p90, p99, and max IOPS across all nodes

How to Use the Troubleshooting Page

Getting Started

  1. Select Time Range: Choose an appropriate time period for analysis
  2. Choose Dashboard: Select from predefined dashboards or create custom ones
  3. Filter Data: Use namespace, label, and annotation filters to focus on specific workloads
  4. Select Charts: Choose relevant graphs from the chart selector based on your investigation needs

Filters

Exporting and Sharing

  • Copy Data: Click on graph elements to copy specific data points
  • Save Dashboards: Create and save custom dashboard configurations
  • Duplicate Dashboards: Create variations of dashboards for different use cases

Support

If you encounter issues with the troubleshooting page or need help interpreting the data, please contact your ScaleOps support team. The troubleshooting page is designed to provide comprehensive visibility into your cluster health and help you make informed decisions about optimization and resource management.