GPU Troubleshooting

On a GPU workload, open Troubleshoot tab. Add charts from the GPU categories and related dashboards (same time range, filters, and layouts as general troubleshooting).

GPU dashboard and GPU chart category

Using the GPU dashboard and additional graphs from GPU chart category you can find different metrics for the GPUs, tied to pods, workloads and nodes.

GPU compute utilization
GPU memory utilization
Memory bandwidth utilization - HBM/DDR load (memory-bound vs compute-bound)
SM active (average and per-device) - streaming-multiprocessor activity; spots under/over-use and imbalance across GPUs
Temperature - GPU (and optionally GPU memory) thermals; throttling and cooling
Power usage - electrical draw; ties to cost, caps, and performance
Encoder / decoder utilization - NVENC/NVDEC for inference/video workloads
BAR1 memory - PCIe BAR1 mapping; driver / MIG / mapping edge cases

Why it helps - Tells you what kind of GPU issue you have - heat, power, memory pressure, uneven GPUs, or simply “not busy enough” - so you’re not guessing when you dig in.

Inference Server dashboard and GPU - Inference Server chart category

Application-level serving metrics (often HTTP/gRPC or framework metrics), aggregated and per pod where available:

Request latency - average end-to-end latency; sometimes a breakdown (queue vs model vs other)
Throughput - requests per second (average, total, per pod)
Request counts - total and per-pod volume

Why it helps - Shows what callers actually experience: slowness, spikes, and traffic. Pair it with GPU charts - if latency is bad but the GPU is idle, look outside the GPU; if the GPU is maxed out, you probably need more capacity or bigger hardware.

GPU - vLLM chart category

vLLM-specific metrics, often avg / per-pod / sums:

Prefix cache hit rate - reuse of cached prefixes
Time to first token (TTFT) - perceived startup latency
Prefill / decode time - prompt processing vs token generation
Prompt vs generation tokens - input/output token volumes
Running vs waiting requests - scheduler pressure and queuing

Why it helps - Shows where time goes (first token vs generation) and whether requests are piling up - so you can tune vLLM with a clear picture instead of trial and error.

GPU - Triton chart category

Triton Inference Server metrics:

Queue size - backlog at the model/server
Batch size - dynamic batching behavior
Failures - inference errors
Cache hit rate - server-side caching effectiveness

Why it helps - Queues and errors show up here even when the GPU graph looks fine - the server can be stuck while the chip looks healthy.