Memory Allocation for Slurm script

Summary

This page explains how to correctly request memory in Slurm scripts on the SDSU Discovery HPC cluster and how to diagnose memory-related errors. On Discovery, the word “memory” always means RAM, not file storage.

Body

CPU Memory

An Out-of-Memory (OOM) error occurs when your job tries to use more RAM than you requested in your Slurm script. If the memory limit you set is lower than what your program actually needs, the job will exceed that limit and be terminated. To avoid this, it’s important to request enough memory based on the size of your data, the tools you’re using, and the expected workload.

You can request memory in your Slurm script using directives such as:

#SBATCH --mem=8G          # total memory for the job

or

#SBATCH --mem-per-cpu=4G  # memory per CPU core

Choosing an appropriate memory value helps prevent OOM failures and ensures that your job runs smoothly from start to finish. Allocating an excess amount will increase your queue times and decrease the priority of subsequent jobs

Estimating Memory Requirements

How do you know how much memory to request? For a simple code, one can look at the data structures that are used and calculate it by hand. For instance, a code that declares an array of 1 million elements in double precision will require 8 MB since a double requires 8 bytes. For other cases, such as a pre-compiled executable or a code that dynamically allocates memory during execution, estimating the memory requirement is much harder. Two approaches are described next.

Checking the Memory Usage of a Running Job

The easiest way to see the memory usage of a job is to use the "jobstats" command on a given JobID: $ jobstats 1234567

In some cases an estimate of the required memory can be obtained by running the code on a laptop or workstation and using the Linux command htop -u $USER or on a Mac the Activity Monitor which is found in /Applications/Utilities. If using htop then look at the RES column for the process of interest.

In summary, if you request too little memory then your job will fail with an out-of-memory (OOM) error. If you request an excessive amount then the job will run successfully but you may have to wait slightly longer than necessary for it to start. Use Slurm email reports and jobstats to set the requested memory for future jobs. In doing so be sure to request slightly more memory than you think you will need for safety.

Memory per Node on Discovery

Each node on the Discovery cluster has a different amount of physical memory, depending on its hardware. To check how much memory is available on the nodes, you can run:

scontrol show nodes | grep -i RealMemory

Uploaded Image (Thumbnail)

This command prints the total RAM (in MB) for every node on the system. Discovery currently includes several node types, some with around 200 GB, some with 250 GB, some with 500 GB, and a few large-memory nodes with about 1 TB of RAM.

Because Slurm on Discovery does not enforce a default memory limit, your job can use up to the full memory of the node unless you explicitly request a smaller amount in your Slurm script.

Why Memory Requests Matter

If your job needs more memory than you request using --mem or --mem-per-cpu, it will eventually run out of RAM and be killed by the system (an Out-of-Memory, or OOM, event). Choosing a memory value that is too small is the most common cause of OOM failures.

To avoid this, always request enough memory for your workload. For example:

#SBATCH --mem=32G

or:

#SBATCH --mem-per-cpu=8G

You can request any amount of memory up to the total available on the node. Just remember to leave a few gigabytes for the operating system.

GPU Memory

GPU memory works differently from CPU memory. When you request a GPU, you automatically get all the memory on that GPU, there is no Slurm option to set GPU memory limits. If your application tries to use more GPU memory than the GPU has, it will produce a “CUDA out of memory” error and stop.

The most common cause of GPU memory errors is setting the batch size too large when training neural networks. Reducing the batch size usually fixes the issue.

You can monitor GPU memory on an active job by SSHing into the compute node and running: nvidia-smi

GPU memory usage over time can also be viewed through Jobstats, but with jobstats, GPU details appears after your job is completed.

Uploaded Image (Thumbnail)

Details

Details

Article ID: 164517
Created
Tue 11/25/25 12:43 PM
Modified
Tue 11/25/25 3:46 PM