GPU Utilization Optimization: The Complete Guide for AI Teams
GPU Utilization Optimization: The Complete Guide for AI Teams
The GPU shortage narrative dominates headlines. Hyperscalers scramble for H100 allocations. Startups wait months for cluster access. But here is the uncomfortable truth most organizations overlook: you probably already have enough GPUs. You are just wasting them.
We have seen this pattern firsthand. Our founding team managed GPU infrastructure and ML platforms at Amazon, and experienced how painful this can be at other companies like Meta and Microsoft. The utilization problem is consistent across organizations of every size, and it is solvable.
This guide is a practitioner's walkthrough of measuring, diagnosing, and fixing GPU utilization. We cover quick wins you can implement this week, architectural changes that pay off over months, and a phased framework for getting from 40% to 85%+ utilization.
What Is GPU Utilization and Why Does It Matter?
GPU utilization measures the percentage of time a GPU's streaming multiprocessors are actively executing work, but that single number hides critical nuance that leads teams to wrong conclusions.
There are three distinct dimensions of GPU usage, and conflating them leads to bad decisions. SM utilization reports whether the GPU's compute cores are active. Memory utilization tracks how much VRAM is allocated. Compute throughput measures the actual useful work being done, including tensor core activity and floating-point operations per second. A GPU can show 90% SM utilization while its tensor cores sit completely idle, which means the hardware is busy but not productive. The NVIDIA DCGM documentation details these metric distinctions.
The gap between "GPU is allocated" and "GPU is doing useful work" is where most waste lives. A researcher reserves 8 GPUs for a training job, but the actual training loop only saturates 4 of them. The other 4 show as "in use" in your scheduler but contribute nothing.
The financial impact scales quickly. At $2-3/hour per H100, every 10% utilization improvement across a 500-GPU fleet saves $876K-$1.3M per year. For larger fleets, the numbers get worse. This is why 53% of enterprises cite cost control as their primary AI infrastructure challenge.
| Utilization Level | Classification | What It Means |
|---|---|---|
| 0-30% | Critical waste | GPUs allocated but mostly idle. Likely static reservations with no active workloads. |
| 30-50% | Underutilized | Some workloads running, but significant idle periods between jobs or within jobs. |
| 50-70% | Typical | Industry average. Room for improvement through scheduling and workload optimization. |
| 70-85% | Good | Active workload management in place. Some idle capacity reserved for burst. |
| 85-95% | Optimized | Near-optimal. Requires dynamic scheduling, preemption, and topology-aware placement. |
Why Are GPUs Underutilized? Root Causes
Understanding why GPUs sit idle is the first step toward fixing the problem. In our experience across hundreds of GPU environments, four root causes account for the vast majority of wasted capacity.
Static Allocation and Team Silos
The most common waste pattern is static GPU allocation. Teams reserve fixed GPU quotas, whether or not they are actively using them. The ML research team holds 64 GPUs around the clock, but their training jobs only run during business hours. On nights and weekends, those GPUs sit idle. No other team can touch them.
This is not a technology problem. It is an organizational one. Teams hoard GPUs because they do not trust the scheduler to give them capacity when they need it. The result is a tragedy of the commons where everyone reserves more than they need, and aggregate utilization suffers.
Data Pipeline Bottlenecks
A GPU can only train as fast as it receives data. When CPU-based preprocessing, slow storage I/O, or insufficient data prefetching creates a bottleneck, the GPU spends cycles waiting instead of computing. This shows up as low SM utilization despite the GPU being "allocated" to an active job.
The telltale sign is a sawtooth utilization pattern: brief spikes of compute activity followed by idle periods while the next batch of data loads. Fixing this requires profiling the entire data pipeline, not just the model.
Scheduling Inefficiency
Default Kubernetes scheduling treats GPUs as binary resources. A GPU is either allocated or not. There is no concept of preemption, backfill scheduling, or fair-share allocation. Most clusters run simple FIFO queues where jobs execute in submission order.
FIFO scheduling creates a specific problem called head-of-line blocking. When a large job at the head of the queue requests 16 nodes but only 10 are currently available, it blocks every job behind it from starting, even if those smaller jobs could run on the available capacity right now. A single large job request can idle dozens of GPUs while smaller jobs queue unnecessarily.
Without preemption, lower-priority workloads (notebooks, hyperparameter sweeps) occupy GPUs that higher-priority production training jobs need. Urgent jobs wait while non-urgent work runs uninterrupted. We will dive deeper into scheduling tools and frameworks in a future blog post.
Workload Mismatches
Running inference on hardware provisioned for training wastes capacity. Deploying a 7B parameter model on an 80GB H100 leaves most of the GPU's memory and compute unused. Conversely, cramming a large model onto insufficient hardware causes out-of-memory errors and restarts, which waste even more time.
Workload-hardware matching requires understanding both the resource profile of each job and the capabilities of each GPU in your fleet. Without fleet-wide visibility, teams default to requesting the most powerful GPU available, regardless of whether they need it.
How to Measure GPU Utilization Accurately
Accurate measurement is the foundation of every utilization improvement, yet most teams get this wrong and make bad optimization decisions as a result.
The most common mistake is relying on nvidia-smi output. The nvidia-smi utility reports SM utilization, which tells you whether the GPU's compute cores are active. But "active" does not mean "productive." A poorly optimized kernel can keep SMs busy while achieving minimal throughput. The NVIDIA DCGM documentation provides the granular metrics you actually need.
DCGM (Data Center GPU Manager) exposes dozens of metrics at the hardware level. The ones that matter most for utilization optimization are:
| Metric | What It Measures | Target Range |
|---|---|---|
DCGM_FI_DEV_GPU_UTIL | SM activity percentage | 70-95% during active training |
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE | Tensor core utilization | 50-80% for mixed precision workloads |
DCGM_FI_DEV_FB_USED | Framebuffer (VRAM) memory used | 70-90% of available memory |
DCGM_FI_DEV_POWER_USAGE | Power draw in watts | Near TDP during active compute |
DCGM_FI_DEV_GPU_TEMP | GPU temperature | Below throttling threshold (83C for H100) |
The combination of these metrics tells a more complete story than any single number. High SM utilization with low tensor core activity suggests the workload is not using mixed precision effectively. High memory usage with low compute utilization points to a data loading bottleneck.
For Kubernetes environments, the standard production stack is DCGM Exporter feeding into Prometheus, visualized through Grafana. We cover the full setup in How to Monitor GPU Utilization in Kubernetes. Chamber's monitoring agent deploys across clusters and cloud providers to provide a unified view without requiring you to stitch together multiple Prometheus instances.
GPU Utilization Optimization Strategies
Once you have accurate measurements and understand your root causes, these are the strategies that deliver the largest utilization improvements. We order them by typical impact, starting with the changes that move the needle most.
Dynamic Resource Allocation and Fair-Share Scheduling
Replacing static GPU allocation with fair-share scheduling is the single highest-impact change most organizations can make. Fair-share scheduling gives each team a guaranteed minimum allocation (their SLA), but automatically lends idle capacity to other teams. When the owning team needs their GPUs back, preemptive queuing reclaims them automatically.
This approach is well-validated. THEMIS (NSDI 2020) demonstrated that finish-time fair scheduling achieves near-optimal cluster utilization while maintaining fairness guarantees across teams. The key insight is that fairness and utilization are not in tension. Static allocation sacrifices both.
Preemptive queuing is the mechanism that makes fair-share work in practice. Elastic workloads (hyperparameter sweeps, notebook sessions, non-urgent experiments) yield their GPUs when production training jobs need capacity. Once the higher-priority job completes, preempted jobs auto-resume from their last checkpoint. This eliminates the "all or nothing" dynamic that drives GPU hoarding.
At Chamber, we implement fair-share scheduling with cross-cluster, cross-cloud awareness. Because our control plane has visibility across your entire fleet, idle capacity on one cluster can serve demand on another. Open-source schedulers like Volcano or Kueue operate within a single cluster boundary, which limits the pool of reclaimable capacity.
Topology-Aware Job Placement
For distributed training jobs, where GPUs are placed relative to each other matters as much as how many GPUs are allocated. GPUs connected via NVLink within the same node communicate orders of magnitude faster than GPUs communicating across the network fabric.
Research presented at SC '17 demonstrated that topology-aware placement yields a 60% throughput improvement and 70% training time reduction for distributed training workloads. These gains come from reduced communication overhead during all-reduce operations, the gradient synchronization step that dominates multi-GPU training.
Topology-aware placement matters most for multi-node training with heavy inter-GPU communication. If your workloads are primarily single-GPU inference or independent hyperparameter sweeps, the impact is minimal. But for large model training (anything above a few billion parameters), topology-naive scheduling leaves significant performance on the table.
Mixed Precision Training and Batch Tuning
Mixed precision training uses FP16 or BF16 for most operations while maintaining FP32 for numerically sensitive computations like loss scaling. Micikevicius et al. (ICLR 2018) showed that mixed precision delivers approximately 2x memory reduction and 2-8x throughput increase with negligible impact on model accuracy.
The memory savings from mixed precision unlock a secondary optimization: larger batch sizes. By halving the per-sample memory footprint, you can double the batch size, which improves GPU utilization by 20-30% through better SM saturation. For workloads where memory constraints prevent increasing batch size directly, gradient accumulation achieves the same effective batch size by accumulating gradients across multiple forward passes before updating weights.
Most modern frameworks (PyTorch AMP, TensorFlow mixed precision) make this a configuration change rather than a code rewrite. If your training jobs are still running in full FP32, this is the easiest win available.
Data Pipeline Optimization
When GPU utilization drops between training iterations, the bottleneck is almost always the data pipeline. Three techniques address this.
Prefetching and parallel data loading. PyTorch's DataLoader with num_workers > 0 and prefetch_factor set appropriately keeps the next batch ready before the GPU finishes the current one. The goal is zero gap between training steps.
GPU-direct storage. For I/O-bound workloads, GPU-direct storage (GDS) bypasses the CPU entirely, streaming data from NVMe storage directly into GPU memory. This removes the CPU as a bottleneck for data-intensive workloads.
Dataset caching. Frequently accessed datasets should live on fast local storage (NVMe) rather than network-attached storage. Caching the preprocessed version of your dataset eliminates redundant CPU work across training runs.
GPU Sharing for Lightweight Workloads
Not every workload needs a full GPU. Inference serving, Jupyter notebooks/interactive sessions, and small-scale experiments often consume a fraction of a GPU's compute and memory. NVIDIA provides three mechanisms for sharing a single GPU across multiple workloads.
Multi-Instance GPU (MIG) partitions an A100 or H100 into up to seven isolated instances, each with dedicated compute and memory. Multi-Process Service (MPS) allows multiple CUDA processes to share a GPU's compute resources cooperatively. Time-slicing rapidly switches between workloads on a single GPU. Each approach has different isolation, performance, and compatibility tradeoffs. We cover them in detail in MIG vs MPS vs Time-Slicing: GPU Sharing Compared.
What KPIs Should You Track for GPU Utilization?
The right KPIs separate organizations that sustain utilization improvements from those that regress after initial gains. These five metrics give GPU operations teams the clearest signal.
| KPI | Definition | Target | How to Measure |
|---|---|---|---|
| Fleet-wide utilization % | Average SM utilization across all allocated GPUs | 70-85% sustained | DCGM Exporter + Prometheus, aggregated weekly |
| Job queue wait time | Time between job submission and job start | < 10 minutes for elastic jobs | Scheduler logs or orchestration platform metrics |
| GPU idle time between jobs | Gap between one job completing and the next starting | < 2 minutes | DCGM utilization time series, detect zero-activity gaps |
| Cost per useful GPU-hour | Total GPU cost divided by hours of productive compute | Decreasing trend quarter-over-quarter | Combine billing data with utilization metrics |
| Job throughput per week | Number of completed training jobs per week | Increasing trend | Scheduler completion logs |
The critical principle: track trends, not snapshots. A single point-in-time utilization reading tells you nothing. Weekly averages, broken down by team and workload type, reveal the patterns that drive optimization decisions. Chamber helps provide this granular level of visibility so that you can truly understand your utilization, cost, and attribution across teams, workloads, and even users.
Getting Started: The Chamber Four-Phase Framework
Jumping straight to advanced scheduling without understanding your current state leads to wasted effort and misconfigured systems. We recommend a phased approach that builds on measurement before making changes.
Phase 1: Measure (Weeks 1-2)
Deploy Chamber agent in your Kubernetes clusters and establish utilization baselines. You need at least two weeks of data to capture the full pattern of your workload mix, including weekly cycles and end-of-sprint bursts.
Chamber's observability agent can be deployed on any Kubernetes cluster using Helm in minutes on any cloud providers or onprem deployment, automatically discovering workloads and GPU resources without requiring changes to your existing jobs. Start with monitoring to understand your GPU landscape before making scheduler changes.
Expected outcome: Clear picture of fleet-wide utilization, identification of worst-performing clusters and teams.
Phase 2: Classify (Weeks 3-4)
Categorize every recurring workload as either reserved (production training, SLA-bound or critical) or elastic (experiments, notebooks, hyperparameter sweeps, workloads that can be preempted). This classification helps determine which workloads are critical and cannot be preempted and those that can. The trade off is guaranteed access to resources but potentially longer wait times for reserved jobs. On the other hand, you may be able to run more elastic jobs in parallel using idle capacity, with the caveat of possible preemption. In otherwords, you can make significantly more progress without trading off the resources you need for critical jobs.
Expected outcome: Workload inventory with priority classifications. Typically reveals that 60-70% of GPU-hours go to elastic workloads that could tolerate preemption.
Phase 3: Centralize and Allocate Resources in Chamber (Weeks 5-8)
Move from walled offed allocation silos to allocating resources through Teams in the Chamber inferace. This is the organizational change that enables fair-sharing for optimal resource usage. Teams keep their guaranteed minimums, but idle capacity becomes available for elastic workloads from other teams. Each team can set their own budget or tolerance for how much idle capacity they want to burst using.
Expected outcome: 15-25% utilization improvement from reclaiming idle reserved capacity.
Phase 4: Optimize (Weeks 9-12+)
After running workloads through Chambers central scheduler, for the next layer of optimization begin reviewing workload level metrics such as GPU utilization, memory utilization and power utilization. These metrics will begin to expose further optimizations such as inefficent data loading steps, opportunities to right size workloads to less expensive hardware and more. Expected outcome: Additional 15-25% utilization improvement. Total improvement from Phase 1 baseline: 30-50%.
To understand how much GPU spend you could reclaim by increasing usage, see our Chamber ROI Calculator.
Frequently Asked Questions
What is a good GPU utilization percentage for AI training?
85-95% is achievable with proper scheduling and workload management. Most organizations operate at 40-60%. The gap is caused by static allocation, scheduling inefficiency, and data pipeline bottlenecks, all of which are solvable without purchasing additional hardware. Chamber customers typically reach 85%+ within 12 weeks of deploying centralized scheduling.
How do you measure GPU utilization in Kubernetes?
DCGM Exporter with Prometheus and Grafana is the standard production stack. DCGM provides granular metrics including SM utilization, tensor core activity, memory usage, and power draw. For multi-cluster visibility, you need either a federated Prometheus setup or a unified control plane that aggregates metrics across clusters. Chamber provides full observability best practices for GPUs out of the box, removing the effort required to setup and maintain your own Prometheus and Granfana configurations and deployments, saving you time and money.
Can you improve GPU utilization without buying new hardware?
Yes. Fair-share scheduling, preemptive queuing, and mixed precision training can improve utilization by 30-50% without purchasing additional GPUs. The ClearML 2025 survey found that 35% of enterprises rank GPU utilization as a top 12-month priority, precisely because the ROI from optimization outperforms the ROI from procurement.
What is the difference between GPU utilization and GPU memory utilization?
GPU utilization measures compute usage. Memory utilization measures VRAM allocation. Both must be tracked because they tell different stories. A GPU can show 95% memory utilization (the model fills VRAM) with only 30% compute utilization (the training loop is bottlenecked on data loading). Optimizing one without monitoring the other leads to not understanding the full picture, limiting the insights on how to optimmize for better performance and reduce costs.
Key Takeaways
- Most organizations waste 40-60% of their GPU capacity due to static allocation and scheduling inefficiency.
- 75% of organizations report GPU utilization below 70% at peak, according to ClearML's 2025 survey.
- Topology-aware placement yields 60% throughput improvement and 70% training time reduction for distributed training.
- 53% of enterprises cite cost control as their primary AI infrastructure challenge, making utilization optimization a budget priority.
- 85-95% utilization is achievable with centralized scheduling, fair-share allocation, and topology-aware placement.
- Start with measurement, then classify workloads, centralize scheduling, and optimize. Phased adoption reduces risk and delivers incremental ROI at every step.
The Bottom Line
You do not need more GPUs. You need to use the ones you have. The data is clear: the majority of GPU waste comes from organizational and scheduling problems, not hardware limitations. Static allocation, FIFO queuing, and team silos create artificial scarcity in environments that have sufficient capacity.
The path from 40% to 85%+ utilization is well-understood. Measure accurately with DCGM. Classify workloads by priority. Centralize scheduling to reclaim idle capacity. Optimize with fair-share allocation, preemption, and topology-aware placement. Each phase delivers measurable ROI before you move to the next.
Chamber provides real-time GPU utilization visibility across your entire fleet, spanning multiple clusters and cloud providers through a single control plane. Start with monitoring to understand your GPU landscape before making scheduler changes. See how it works.
Sources
- ClearML. "The State of AI Infrastructure at Scale 2025-2026." 2025. https://go.clear.ml/the-state-of-ai-infrastructure-at-scale-2024
- ClearML. "New ClearML Report Reveals Cost and Governance Concerns Dominate." 2025. https://www.morningstar.com/news/accesswire/1118009msn/new-clearml-report-reveals-cost-and-governance-concerns-dominate-as-nearly-half-of-enterprises-waste-millions-on-underutilized-gpu-capacity
- Micikevicius, P. et al. "Mixed Precision Training." ICLR 2018. https://arxiv.org/abs/1710.03740
- NVIDIA. "DCGM User Guide: Feature Overview." 2024. https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html
- Mahajan, K. et al. "THEMIS: Fair and Efficient GPU Cluster Scheduling." NSDI 2020. https://wisr.cs.wisc.edu/papers/nsdi20-themis.pdf
- Amaral, M. et al. "Topology-Aware GPU Scheduling for Learning Workloads in Cloud Environments." SC '17, ACM. https://dl.acm.org/doi/10.1145/3126908.3126933