GPU Troubleshooting
On a GPU workload, open Troubleshoot tab. Add charts from the GPU categories and related dashboards (same time range, filters, and layouts as general troubleshooting).
GPU dashboard and GPU chart category
Using the GPU dashboard and additional graphs from GPU chart category you can find different metrics for the GPUs, tied to pods, workloads and nodes.
- GPU compute utilization
- GPU memory utilization
- Memory bandwidth utilization - HBM/DDR load (memory-bound vs compute-bound)
- SM active (average and per-device) - streaming-multiprocessor activity; spots under/over-use and imbalance across GPUs
- Temperature - GPU (and optionally GPU memory) thermals; throttling and cooling
- Power usage - electrical draw; ties to cost, caps, and performance
- Encoder / decoder utilization - NVENC/NVDEC for inference/video workloads
- BAR1 memory - PCIe BAR1 mapping; driver / MIG / mapping edge cases
Why it helps - Tells you what kind of GPU issue you have - heat, power, memory pressure, uneven GPUs, or simply “not busy enough” - so you’re not guessing when you dig in.
Inference Server dashboard and GPU - Inference Server chart category
Application-level serving metrics (often HTTP/gRPC or framework metrics), aggregated and per pod where available:
- Request latency - average end-to-end latency; sometimes a breakdown (queue vs model vs other)
- Throughput - requests per second (average, total, per pod)
- Request counts - total and per-pod volume
Why it helps - Shows what callers actually experience: slowness, spikes, and traffic. Pair it with GPU charts - if latency is bad but the GPU is idle, look outside the GPU; if the GPU is maxed out, you probably need more capacity or bigger hardware.
GPU - vLLM chart category
vLLM-specific metrics, often avg / per-pod / sums:
- Prefix cache hit rate - reuse of cached prefixes
- Time to first token (TTFT) - perceived startup latency
- Prefill / decode time - prompt processing vs token generation
- Prompt vs generation tokens - input/output token volumes
- Running vs waiting requests - scheduler pressure and queuing
Why it helps - Shows where time goes (first token vs generation) and whether requests are piling up - so you can tune vLLM with a clear picture instead of trial and error.
GPU - Triton chart category
Triton Inference Server metrics:
- Queue size - backlog at the model/server
- Batch size - dynamic batching behavior
- Failures - inference errors
- Cache hit rate - server-side caching effectiveness
Why it helps - Queues and errors show up here even when the GPU graph looks fine - the server can be stuck while the chip looks healthy.