Submitting Jobs with Slurm

What is Slurm?

Slurm (Simple Linux Utility for Resource Management) is an open-source job scheduler and workload manager used on both the Innovator and Discovery HPC clusters at SDSU. Slurm performs several important functions:

  • Allocates time and resources on worker nodes to perform a job
  • Allows users to start, monitor, and manage jobs running on worker nodes
  • Queues and balances job submissions fairly across all users on the cluster

All jobs on both Innovator and Discovery must be submitted through Slurm. You should never run computationally intensive tasks directly on the login node.

Partitions (Node Types)

A partition, also known as a queue, is a subset of the cluster nodes that share the same characteristics. Users can specify which partition to run a job on. If no partition is specified, the job will run on the default compute partition.

Innovator Partitions

Partition Time Limit Memory per Node CPUs per Node Nodes Best For
compute 14 days 256 GB 48 46 General purpose jobs (default)
bigmem 14 days 2 TB 48 4 Memory intensive jobs
gpu 14 days 512 GB 48 14 GPU and machine learning jobs (2x NVIDIA A100 80GB per node)
quickq 12 hours 256 GB 48 46 Short jobs and testing

In plain terms for Innovator:

  • Use compute for most general research jobs
  • Use bigmem if your job needs more than 256 GB of memory
  • Use gpu if your job requires GPU acceleration such as deep learning or machine learning
  • Use quickq for short test runs under 12 hours — jobs start faster here

Discovery Partitions

Partition Time Limit Memory per Node CPUs per Node Nodes Best For
compute 14 days 256 GB 48 10 (includes 2 big memory) General purpose jobs (default)
gpu 14 days 512 GB 48 5 GPU jobs (2x GPU per node)
all-gpu 14 days 512 GB — 1 TB 48 7 All GPU nodes including large GPU nodes (lg001, lg002) with 4x GPUs and 1 TB RAM

In plain terms for Discovery:

  • Use compute for general research jobs
  • Use gpu for standard GPU jobs — each node has 2 GPUs and 512 GB RAM
  • Use all-gpu for jobs needing maximum GPU resources — includes large GPU nodes with 4 GPUs and 1 TB RAM each

Viewing Partition and Job Status

These commands work the same on both Innovator and Discovery.

To view the current state of all partitions and nodes:

[john.doe@jacks.local@cllogin002 ~]$ sinfo

To view only your own jobs:

[john.doe@jacks.local@cllogin002 ~]$ squeue -u $USER

To monitor your jobs and refresh every 30 seconds:

[john.doe@jacks.local@cllogin002 ~]$ watch -n 30 squeue -u $USER

Press Ctrl+C to stop the watch display.

Job Types

There are two main types of jobs you can run on both Innovator and Discovery:

Interactive Jobs — the user requests a node via Slurm and runs commands directly on the command line. Interactive jobs end if the user logs off the cluster. Best for testing, debugging, and short tasks.

Batch Jobs — jobs designed to run one or more scripts without user interaction. The job is submitted to the scheduler using a job submission file (sbatch file). These jobs continue running even if the user logs off. Output goes to a log file instead of the terminal. Best for long running research jobs.

Running an Interactive Job

Interactive jobs are started with the srun command. These examples work on both Innovator and Discovery.

To request one node on the default compute partition:

[john.doe@jacks.local@cllogin002 ~]$ srun --pty bash

[john.doe@jacks.local@node040 ~]$

To request a big memory node:

[john.doe@jacks.local@cllogin002 ~]$ srun --pty -p bigmem bash

[john.doe@jacks.local@bigmem003 ~]$

To request a GPU node with 1 GPU for 1 hour:

[john.doe@jacks.local@cllogin002 ~]$ srun -N 1 -n 40 --time=1:00:00 --partition=gpu --gres=gpu:1 --pty bash

[john.doe@jacks.local@gpu001 ~]$

Running a Batch Job

To run a batch job, write a job submission script containing lines prefixed with #SBATCH that tell Slurm what resources to allocate. This works the same on both Innovator and Discovery — just specify the correct partition for the cluster you are using.

Example batch job script:

#!/bin/bash

#SBATCH --job-name=myjob # Job name

#SBATCH --nodes=1 # Number of nodes

#SBATCH --ntasks-per-node=4 # CPUs per node (max 48 for all nodes)

#SBATCH --output=log.log # Output log file name

#SBATCH --partition=compute # Partition: see partition tables above

#SBATCH --time=1-00:00:00 # Time limit: days-hours:minutes:seconds

 

module load <module name>

 

## Add any additional modules above this line

## Your job commands go below this line

Save the file with a .slurm extension and submit it using:

[john.doe@jacks.local@cllogin002 ~]$ sbatch myjob.slurm

Submitted batch job 334

Cancelling a Job

To cancel a specific job using its job ID:

[john.doe@jacks.local@cllogin002 ~]$ scancel 12243

To cancel all your jobs at once:

[john.doe@jacks.local@cllogin002 ~]$ scancel -u $USER

Common Slurm Commands Reference

For a full list of Slurm commands, refer to the dedicated article: Slurm Cluster Resource Manager Commands

Questions or Problems

If you have any questions or need assistance with job submissions on either Innovator or Discovery, contact the SDSU RCi team:

Was this helpful?
0 reviews