Submitting Jobs with Slurm

cluster Thunder

What is Slurm?

Slurm (Simple Linux Utility for Resource Management) is an open-source job scheduler and workload manager used on both the Innovator and Discovery HPC clusters at SDSU. Slurm performs several important functions:

Allocates time and resources on worker nodes to perform a job
Allows users to start, monitor, and manage jobs running on worker nodes
Queues and balances job submissions fairly across all users on the cluster

All jobs on both Innovator and Discovery must be submitted through Slurm. You should never run computationally intensive tasks directly on the login node.

Partitions (Node Types)

A partition, also known as a queue, is a subset of the cluster nodes that share the same characteristics. Users can specify which partition to run a job on. If no partition is specified, the job will run on the default compute partition.

Innovator Partitions

Partition	Time Limit	Memory per Node	CPUs per Node	Nodes	Best For
`compute`	14 days	256 GB	48	46	General purpose jobs (default)
`bigmem`	14 days	2 TB	48	4	Memory intensive jobs
`gpu`	14 days	512 GB	48	14	GPU and machine learning jobs (2x NVIDIA A100 80GB per node)
`quickq`	12 hours	256 GB	48	46	Short jobs and testing

In plain terms for Innovator:

Use compute for most general research jobs
Use bigmem if your job needs more than 256 GB of memory
Use gpu if your job requires GPU acceleration such as deep learning or machine learning
Use quickq for short test runs under 12 hours — jobs start faster here

Discovery Partitions

Partition	Time Limit	Memory per Node	CPUs per Node	Nodes	Best For
`compute`	14 days	256 GB	48	10 (includes 2 big memory)	General purpose jobs (default)
`gpu`	14 days	512 GB	48	5	GPU jobs (2x GPU per node)
`all-gpu`	14 days	512 GB — 1 TB	48	7	All GPU nodes including large GPU nodes (lg001, lg002) with 4x GPUs and 1 TB RAM

In plain terms for Discovery:

Use compute for general research jobs
Use gpu for standard GPU jobs — each node has 2 GPUs and 512 GB RAM
Use all-gpu for jobs needing maximum GPU resources — includes large GPU nodes with 4 GPUs and 1 TB RAM each

Viewing Partition and Job Status

These commands work the same on both Innovator and Discovery.

To view the current state of all partitions and nodes:

[john.doe@jacks.local@cllogin002 ~]$ sinfo

To view only your own jobs:

[john.doe@jacks.local@cllogin002 ~]$ squeue -u $USER

To monitor your jobs and refresh every 30 seconds:

[john.doe@jacks.local@cllogin002 ~]$ watch -n 30 squeue -u $USER

Press Ctrl+C to stop the watch display.

Job Types

There are two main types of jobs you can run on both Innovator and Discovery:

Interactive Jobs — the user requests a node via Slurm and runs commands directly on the command line. Interactive jobs end if the user logs off the cluster. Best for testing, debugging, and short tasks.

Batch Jobs — jobs designed to run one or more scripts without user interaction. The job is submitted to the scheduler using a job submission file (sbatch file). These jobs continue running even if the user logs off. Output goes to a log file instead of the terminal. Best for long running research jobs.

Running an Interactive Job

Interactive jobs are started with the srun command. These examples work on both Innovator and Discovery.

To request one node on the default compute partition:

[john.doe@jacks.local@cllogin002 ~]$ srun --pty bash
[john.doe@jacks.local@node040 ~]$

To request a big memory node:

[john.doe@jacks.local@cllogin002 ~]$ srun --pty -p bigmem bash
[john.doe@jacks.local@bigmem003 ~]$

To request a GPU node with 1 GPU for 1 hour:

[john.doe@jacks.local@cllogin002 ~]$ srun -N 1 -n 40 --time=1:00:00 --partition=gpu --gres=gpu:1 --pty bash
[john.doe@jacks.local@gpu001 ~]$

Running a Batch Job

To run a batch job, write a job submission script containing lines prefixed with #SBATCH that tell Slurm what resources to allocate. This works the same on both Innovator and Discovery — just specify the correct partition for the cluster you are using.

Example batch job script:

#!/bin/bash
#SBATCH --job-name=myjob # Job name
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks-per-node=4 # CPUs per node (max 48 for all nodes)
#SBATCH --output=log.log # Output log file name
#SBATCH --partition=compute # Partition: see partition tables above
#SBATCH --time=1-00:00:00 # Time limit: days-hours:minutes:seconds
 
module load <module name>
 
## Add any additional modules above this line
## Your job commands go below this line

Save the file with a .slurm extension and submit it using:

[john.doe@jacks.local@cllogin002 ~]$ sbatch myjob.slurm
Submitted batch job 334

Cancelling a Job

To cancel a specific job using its job ID:

[john.doe@jacks.local@cllogin002 ~]$ scancel 12243

To cancel all your jobs at once:

[john.doe@jacks.local@cllogin002 ~]$ scancel -u $USER

Common Slurm Commands Reference

For a full list of Slurm commands, refer to the dedicated article: Slurm Cluster Resource Manager Commands

Questions or Problems

If you have any questions or need assistance with job submissions on either Innovator or Discovery, contact the SDSU RCi team:

Support form: https://help.sdstate.edu/TDClient/2744/Portal/Requests/ServiceDet?ID=53689
Email: SDSU.HPC@sdstate.edu
Phone: 605-688-6776

Was this helpful?

0 reviews

Print Article

Updating...