How to work with GPUs

Overview

Graphics Processing Units (GPUs) are used for specialized computing tasks such as machine learning and 3D modeling. They are designed to accelerate highly parallel workloads. If your application or program is not GPU-enabled, requesting a GPU will not improve performance.

This article covers available GPU resources, how to request a GPU in Slurm interactively and in a batch script, and how to confirm the GPU is being utilized.

We ask that users start by requesting a single GPU and confirm that their job is utilizing it before requesting additional GPUs. GPU jobs may remain pending for some time—this is normal, as GPU resources are limited and in high demand. Using specific constraints may increase wait time. 

Available GPU Resources

GPU Type Nodes GPUs per Node VRAM per GPU (GB)
L40S alh[1-6] 4 48
H200 msa[1-4] 4 141
V100 rom1 2 16
V100 voh[1,3-8] 4 16

Checking GPU Availability

GPUs are available in the gpu QoS.

You can check GPU availability with the ‘freenodes’ command: 

freenodes -g gpu

 

Requesting GPU resources

To request a GPU, you need to use the gpu QoS and use the ‘--gres=gpu:1’ directive. You can modify the number for how many GPUs you’d like to request. You can utilize the ‘--constraint’ directive to specify a specific GPU type you’d like to target, such as l40s, h200, and v100.

In a batch script:

#SBATCH -q gpu 

#SBATCH --gres=gpu:1

You can further specify a GPU type with its constraint: 

#SBATCH --constraint=h200 

There is an example GPU job batch script available for users to copy and modify for their needs. 

To copy the script, where ‘.’ is the current directory you are in: 

cp /wsu/el7/scripts/tutorial/gpu.sh .

In an interactive job: 

srun -q gpu -n 1 --mem=4G --gres=gpu:1 -t 2:00:00 --pty bash

You can further specify a GPU type with its constraint: 

srun -q gpu -n 1 --mem=4G --constraint=h200 --gres=gpu:1 -t 2:00:00 --pty bash

You can also use our helper command: 

igpu 

Confirming GPU allocation and usage

There are a few commands you can use to confirm GPU information after your job starts.

Check which node was assigned

You can check which node you were assigned to your job using:

 qme 

Connect to the node

Then you can ssh to the node your job is on, for example:

 ssh voh1

View GPU information

You can use the nvidia-smi command to view GPU information, including GPU memory usage, utilization, and active processes:

 nvidia-smi

In this example, a CUDA  program named cuda_example is being run and using nvidia-smi, we check to see if it is running on the GPU. We see that it is listed under Processes and is consuming GPU memory on GPU 0.

To continuously monitor GPU usage while your application is running, use: 

watch -n 2 nvidia-smi

If nvidia-smi shows no GPU utilization while your program is running, then your application may not be using the GPU.

GPU related modules

CUDA

CUDA is a parallel computing platform and programming model developed by NVIDIA for computational tasks on GPUs. With CUDA, programmers can speed up the computations significantly by making use of the GPUs.

CUDA is available as a module on the Grid.

To search for CUDA modules, run the following command:

module spider cuda

For CUDA versions 12.9, 13.0, and 13.2.1 we have containers available for use.

NVHPC SDK

The NVIDIA HPC SDK C, C++, and Fortran compilers support GPU acceleration of HPC modeling and simulation applications with standard C++ and Fortran, OpenACC® directives, and CUDA®. GPU-accelerated math libraries maximize performance on common HPC algorithms, and optimized communications libraries enable standards-based multi-GPU and scalable systems programming. Performance profiling and debugging tools simplify porting and optimization of HPC applications, and containerization tools enable easy deployment on-premises or in the cloud. With support for NVIDIA GPUs and Arm or x86-64 CPUs running Linux, the HPC SDK provides the tools you need to build NVIDIA GPU-accelerated HPC applications.

NVHPC is available as a module on the Grid.

To search for NVHPC modules, run the following command:

module spider nvhpc