Understanding GPU Performance
Recall that there are typically three main steps to executing a function on a GPU in a scientific code. First, you copy the input data from the CPU memory to the GPU memory. Second, you load and execute the GPU kernel on the GPU. Third, you copy the results from the GPU memory back to CPU memory. Effective GPU utilization requires minimizing data transfer between the CPU and GPU while at the same time maintaining a sufficiently high transfer rate to keep the GPU busy with intensive computations. The algorithm running on the GPU must also be amenable to GPU acceleration.
When the GPU is underutilized, the reason is often that data is not being sent to it fast enough. In some cases this is due to hardware limitations such as slow interconnects, while in others it is due to poorly written CPU code or users not taking advantage of the data loading and transfer functionality of their software.
Strategies for Better Performance
If you are experiencing poor GPU utilization, there are several approaches you can try. Start by consulting the documentation or user community for your specific software. In some cases, just making a single change in your input file or configuration can lead to excellent performance. The developers of the software you're using have likely encountered similar issues and may have specific recommendations.
If you are running a deep learning framework such as PyTorch or TensorFlow, try using the specialized data loading classes and functions these frameworks provide. PyTorch has DataLoader with multiple worker processes, while TensorFlow offers tf.data with prefetching capabilities. These tools are specifically designed to keep the GPU fed with data while it processes the previous batch.
Another parameter worth experimenting with is the batch size. Larger batch sizes can improve GPU utilization by providing more work per GPU kernel launch, though you should verify this doesn't negatively affect your model's performance metrics such as accuracy or RMSE. Keep in mind that our H100 GPUs on Discovery have 80 GB of memory. If you exceed this limit, you will encounter a CUDA out of memory error which will cause your code to crash.
If you have tried these approaches and are still unable to reach an acceptable level of GPU utilization, it may be that your particular workload is not well-suited for GPU acceleration. In such cases, please run your jobs on CPU nodes where they may actually complete faster and more efficiently.
Zero GPU Utilization: Common Causes
When users see 0% GPU utilization in their jobstats report, it typically falls into one of three categories.
The first and most common issue is that the code is not GPU-enabled. Only codes that have been explicitly written to use GPUs can take advantage of them. Many scientific applications come in both CPU-only and GPU-enabled versions, and you need to ensure you're using the GPU version. Please consult the documentation for your software and check that you've loaded the appropriate GPU-enabled module on Discovery. If your code is not GPU-enabled, please remove the gres directive from your Slurm script when submitting jobs.
The second common issue is improper software environment configuration. In some cases, certain libraries must be available for your code to run on GPUs. The solution might be to load a specific environment module or to install a particular software dependency. If your code uses CUDA, make sure you have CUDA Toolkit 11 or higher loaded. Please check your software environment against the installation directions for your code.
The third issue relates to interactive sessions. Please do not create salloc sessions or interactive jobs for long periods of time. For example, allocating a GPU for 24 hours is wasteful unless you plan to work intensively during the entire period. If you need interactive GPU access for development or testing, consider using shorter time allocations and only requesting the GPU when you're actively working.
Low GPU Utilization: What to Check
If you encounter low GPU utilization, perhaps showing only 15% or 20% in your jobstats report, please investigate the reasons for the low efficiency. There are several common causes worth examining.
First, review your application scripts and configuration files carefully. Be sure to read the documentation of the software to make sure that you are using it properly. This includes creating the appropriate software environment with all necessary modules loaded. Sometimes a single misconfigured parameter can prevent the GPU from being fully utilized.
Second, consider whether you're using appropriate hardware for your workload. Some codes simply do not have enough computational work to keep a full GPU busy. If your problem size is small or your code performs relatively simple operations, you may not see high GPU utilization no matter how well you configure things.
Third, if you're training deep learning models, make sure you're using multiple CPU cores for data loading. Codes such as PyTorch and TensorFlow show significant performance benefits when multiple CPU cores work in parallel to prepare and transfer data to the GPU. A common mistake is requesting only a single CPU core with a GPU, which creates a bottleneck in the data pipeline.
Fourth, you may be using too many GPUs for your particular job. Not all codes scale efficiently across multiple GPUs, and sometimes using fewer GPUs with proper configuration yields better overall utilization. You can find the optimal number of GPUs and CPU cores by performing a scaling analysis with your specific workload.
Finally, pay attention to where your job writes output files. If you're actively writing large amounts of data during your job, make sure you're using the scratch filesystem rather than slower storage systems. This can significantly impact overall performance and GPU utilization.
Common Mistakes to Avoid
The most common mistake users make is running a CPU-only code on a GPU node. It bears repeating that only codes that have been explicitly written to run on a GPU can take advantage of GPU hardware. Read the documentation for the code that you are using to see if it can use a GPU, and if it can, verify that you're using the GPU-enabled version.
Another common mistake is to run a code that is written to work with a single GPU on multiple GPUs without modifying the code. TensorFlow, for example, will only take advantage of more than one GPU if your script is explicitly written to do so. Simply requesting multiple GPUs in your Slurm script does not automatically make your code use them.
Note that in all cases, whether your code actually used the GPU or not, your fairshare value will be reduced in proportion to the resources you requested in your Slurm script. This means that the priority of your next job will be decreased accordingly. Because of this, and to not waste shared resources, it is very important to make sure that you only request GPUs when you can efficiently utilize them. Use the jobstats command after your jobs complete to verify your GPU utilization and adjust your resource requests accordingly.