CPU Memory
An Out-of-Memory (OOM) error occurs when your job tries to use more RAM than you requested in your Slurm script. If the memory limit you set is lower than what your program actually needs, the job will exceed that limit and be terminated. To avoid this, it’s important to request enough memory based on the size of your data, the tools you’re using, and the expected workload.
You can request memory in your Slurm script using directives such as:
#SBATCH --mem=8G # total memory for the job
or
#SBATCH --mem-per-cpu=4G # memory per CPU core
Choosing an appropriate memory value helps prevent OOM failures and ensures that your job runs smoothly from start to finish. Allocating an excess amount will increase your queue times and decrease the priority of subsequent jobs
Estimating Memory Requirements
How do you know how much memory to request? For a simple code, one can look at the data structures that are used and calculate it by hand. For instance, a code that declares an array of 1 million elements in double precision will require 8 MB since a double requires 8 bytes. For other cases, such as a pre-compiled executable or a code that dynamically allocates memory during execution, estimating the memory requirement is much harder. Two approaches are described next.
Checking the Memory Usage of a Running Job
The easiest way to see the memory usage of a job is to use the "jobstats" command on a given JobID: $ jobstats 1234567
In some cases an estimate of the required memory can be obtained by running the code on a laptop or workstation and using the Linux command htop -u $USER or on a Mac the Activity Monitor which is found in /Applications/Utilities. If using htop then look at the RES column for the process of interest.
In summary, if you request too little memory then your job will fail with an out-of-memory (OOM) error. If you request an excessive amount then the job will run successfully but you may have to wait slightly longer than necessary for it to start. Use Slurm email reports and jobstats to set the requested memory for future jobs. In doing so be sure to request slightly more memory than you think you will need for safety.
Memory per Node on Discovery
Each node on the Discovery cluster has a different amount of physical memory, depending on its hardware. To check how much memory is available on the nodes, you can run:
scontrol show nodes | grep -i RealMemory

This command prints the total RAM (in MB) for every node on the system. Discovery currently includes several node types, some with around 200 GB, some with 250 GB, some with 500 GB, and a few large-memory nodes with about 1 TB of RAM.
Because Slurm on Discovery does not enforce a default memory limit, your job can use up to the full memory of the node unless you explicitly request a smaller amount in your Slurm script.
Why Memory Requests Matter
If your job needs more memory than you request using --mem or --mem-per-cpu, it will eventually run out of RAM and be killed by the system (an Out-of-Memory, or OOM, event). Choosing a memory value that is too small is the most common cause of OOM failures.
To avoid this, always request enough memory for your workload. For example:
#SBATCH --mem=32G
or:
#SBATCH --mem-per-cpu=8G
You can request any amount of memory up to the total available on the node. Just remember to leave a few gigabytes for the operating system.
GPU Memory
GPU memory works differently from CPU memory. When you request a GPU, you automatically get all the memory on that GPU, there is no Slurm option to set GPU memory limits. If your application tries to use more GPU memory than the GPU has, it will produce a “CUDA out of memory” error and stop.
The most common cause of GPU memory errors is setting the batch size too large when training neural networks. Reducing the batch size usually fixes the issue.
You can monitor GPU memory on an active job by SSHing into the compute node and running: nvidia-smi
GPU memory usage over time can also be viewed through Jobstats, but with jobstats, GPU details appears after your job is completed.
