General HPC Questions
Q1: What HPC clusters are available at SDSU?
SDSU has two HPC clusters: Innovator and Discovery. Innovator is available to all Board of Regents (BOR) institutions including SDSU, USD, SDSMT, DSU, and BHSU. Discovery is focused on SDSU faculty, staff, and students who have priority access.
Q2: What is the difference between Innovator and Discovery?
Innovator is a BOR-wide resource available to all South Dakota Board of Regents institutions. It has 46 compute nodes, 4 big memory nodes, 14 GPU nodes with NVIDIA A100 80GB GPUs, and 3 PB of storage. Discovery is focused on SDSU users and has a stronger GPU focus with NVIDIA H100 80GB GPUs, including large GPU nodes with 4 GPUs and 1 TB RAM each. Discovery has 200 Gbps Infiniband while Innovator has 100 Gbps.
Q3: How do I request access to the HPC clusters?
Complete the RCi HPC onboarding form at: https://help.sdstate.edu/TDClient/2744/Portal/Requests/TicketRequests/NewForm?ID=BJcNDsievG4_&RequestorType=Service. Once submitted, the RCi team will reach out and schedule a quick onboarding meeting if needed.
Q4: Who can use Innovator?
Innovator is available to all Board of Regents institutions: South Dakota State University (SDSU), University of South Dakota (USD), South Dakota School of Mines and Technology (SDSMT), Dakota State University (DSU), and Black Hills State University (BHSU).
Q5: Who can use Discovery?
Discovery is focused on SDSU faculty, staff, and students who have priority access. While other BOR users may have access, SDSU users are given priority on this resource.
Innovator Cluster
Q6: What are the hardware specifications of Innovator?
Innovator consists of: 46 Compute Nodes (Dell PowerEdge R650, 2x Intel Xeon Gold 6342 @ 2.80GHz, 48 cores, 256 GB RAM), 4 Big Memory Nodes (Dell PowerEdge R750, 48 cores, 2 TB RAM expandable to 4 TB), and 14 GPU Nodes (Dell PowerEdge R750, 48 cores, 512 GB RAM, 2x NVIDIA A100 80GB per node). Total: 3,072 CPU cores.
Q7: What storage is available on Innovator?
Innovator is attached to a 3 PB Arcastream Pixstor GPFS parallel filesystem: 2 PB usable research storage, 512 TB Flash Tier for faster read/write speeds, and 512 TB for RCi software. Each user gets 100 GB home directory quota. Scratch storage is available on request with no quota but a data expiration policy will apply.
Q8: What are the partitions on Innovator?
Innovator has 4 partitions: compute (46 nodes, 256 GB RAM, 14-day limit, default), bigmem (4 nodes, 2 TB RAM, 14-day limit), gpu (14 nodes, 512 GB RAM, 2x NVIDIA A100 80GB, 14-day limit), and quickq (46 nodes, 256 GB RAM, 12-hour limit for short/test jobs).
Q9: What GPU is available on Innovator?
Innovator GPU nodes have 2x NVIDIA A100 80GB cards per node across 14 GPU nodes. Use the gpu partition to access these resources with --gres=gpu:1 or --gres=gpu:2 in your job script.
Q10: What is the home directory path on Innovator?
For SDSU users the home directory is /home/jacks.local/username. For SDSMT users it is /home/SDSMT.LOCAL/username (case sensitive). The scratch directory follows the same format: /scratch/jacks.local/username.
Q11: What operating system does Innovator run?
Innovator runs on Rocky 9 Linux.
Q12: What is the network speed on Innovator?
Innovator uses 100 Gbps Infiniband for cluster data application processing and science data transfers, and 1 Gbps for cluster management.
Discovery Cluster
Q13: What are the hardware specifications of Discovery?
Discovery consists of: 10 Compute Nodes (Dell PowerEdge R650, 48 cores, 256 GB RAM), 2 Big Memory Nodes (Dell PowerEdge R750, 48 cores, 2 TB RAM), 5 Standard GPU Nodes (Dell PowerEdge R760xa, 48 cores, 512 GB RAM, 2x NVIDIA H100 80GB), and 2 Large GPU Nodes (Dell PowerEdge XE8640, 48 cores, 1 TB RAM, 4x NVIDIA H100 80GB SXM4). Total: 1,000 CPU cores with a primary focus on GPU resources.
Q14: What are the partitions on Discovery?
Discovery has 3 partitions: compute (10 nodes including 2 big memory, 256 GB RAM, 14-day limit, default), gpu (5 standard GPU nodes, 512 GB RAM, 2 GPUs per node, 14-day limit), and all-gpu (7 nodes including large GPU nodes lg001 and lg002 with 4 GPUs and 1 TB RAM each, 14-day limit).
Q15: What GPU is available on Discovery?
Discovery has NVIDIA H100 80GB GPUs. Standard GPU nodes (g001-g005) have 2x H100 80GB each. Large GPU nodes (lg001 and lg002) have 4x H100 80GB SXM4 each with 1 TB RAM. Use the gpu partition for standard GPU jobs or all-gpu partition to access all GPU nodes including the large ones.
Q16: What storage is available on Discovery?
Discovery is attached to a 1.6 PB RAW Arcastream Pixstor GPFS parallel filesystem. Each user gets 100 GB home directory quota. Scratch storage is available on request with no quota but a Scratch Data Retention Schedule will be applied.
Q17: What is the home directory path on Discovery?
For SDSU users the home directory is /home/jacks.local/username. The scratch directory is /scratch/jacks.local/username. Directory paths are case sensitive.
Q18: What operating system does Discovery run?
Discovery runs on RHEL 9 Linux.
Q19: What is the network speed on Discovery?
Discovery uses 200 Gbps Infiniband for cluster data application processing and science data transfers, and 1 Gbps for cluster management.
Q20: What is the difference between the gpu and all-gpu partitions on Discovery?
The gpu partition includes only the 5 standard GPU nodes (g001-g005) each with 2x NVIDIA H100 80GB and 512 GB RAM. The all-gpu partition includes all 7 GPU nodes including the 2 large GPU nodes (lg001 and lg002) which have 4x NVIDIA H100 80GB SXM4 and 1 TB RAM each. Use all-gpu when you need the large GPU nodes.
Logging into the Clusters
Q21: How do I log into Innovator via SSH?
Use SSH with your username in this format:
ssh john.doe@jacks.local@innovator.sdstate.edu
Replace john.doe with your first.last name. For students use jdoe@jacks.local format. You can use any SSH client including MobaXterm, PuTTY, or the terminal on Mac/Linux.
Q22: How do I log into Discovery via SSH?
Use SSH with your username in this format:
ssh john.doe@jacks.local@discovery.sdstate.edu
Replace john.doe with your first.last name. For students use jdoe@jacks.local format.
Q23: How do I log in using MobaXterm?
Open MobaXterm, click Session, select SSH. For Innovator enter hostname: innovator.sdstate.edu. For Discovery enter: discovery.sdstate.edu. Check Specify username and enter first.lastname@jacks.local. Click OK and enter your password when prompted. Passwords do not display while typing — type correctly and press Enter.
Q24: How do I log in using PuTTY?
Open PuTTY. In the Host Name field enter john.doe@jacks.local@innovator.sdstate.edu for Innovator or john.doe@jacks.local@discovery.sdstate.edu for Discovery. Set Port to 22 and Connection type to SSH. Click Open, accept the security prompt, and enter your password.
Q25: How do I access Innovator via Open OnDemand?
Open your browser and go to https://ondemand.sdstate.edu. Enter your email as first.lastname@jacks.sdstate.edu and your password, then click Sign In.
Q26: How do I access Discovery via Open OnDemand?
Open your browser and go to https://mydiscovery.sdstate.edu. Enter your email as first.lastname@jacks.sdstate.edu and your password, then click Sign In.
Q27: I am from USD, what domain do I use to log in?
USD users use @usd.local. For example: ssh jane.doe@usd.local@innovator.sdstate.edu.
Q28: I am from SDSMT, what domain do I use?
SDSMT users use @SDSMT.LOCAL (case sensitive). For example: ssh jane.doe@SDSMT.LOCAL@innovator.sdstate.edu.
Q29: What domains do BOR institutions use to log in?
SDSU uses @jacks.local, USD uses @usd.local, SDSMT uses @SDSMT.LOCAL (case sensitive), DSU uses @dsu.local, and BHSU uses @blackhills.local.
Q30: I cannot log in to the cluster, what should I do?
Check the following: confirm your username format is correct, ensure you are using the correct domain for your institution, verify you are connecting to the correct hostname, ensure your password is correct, and remember passwords do not display while typing in SSH terminals. If problems persist contact SDSU.HPC@sdstate.edu or submit a request at https://help.sdstate.edu/TDClient/2744/Portal/Requests/ServiceDet?ID=53689.
Q31: I am a new user, how do I get started with the HPC cluster?
Complete the onboarding form at https://help.sdstate.edu/TDClient/2744/Portal/Requests/TicketRequests/NewForm?ID=BJcNDsievG4_&RequestorType=Service. After approval access the cluster via SSH using MobaXterm or PuTTY, or via Open OnDemand at ondemand.sdstate.edu for Innovator or mydiscovery.sdstate.edu for Discovery.
Slurm Job Submission
Q32: What is Slurm?
Slurm (Simple Linux Utility for Resource Management) is an open-source job scheduler used on both Innovator and Discovery HPC clusters at SDSU. It allocates resources on worker nodes, allows users to submit and monitor jobs, and balances job submissions across all users. All computationally intensive work must be submitted through Slurm — never run heavy jobs directly on the login node.
Q33: How do I submit a job on the cluster?
Write a job submission script with #SBATCH directives, save it as myjob.slurm, then submit using:
Q34: How do I check the status of my jobs?
squeue -u $USER # View only your jobs
squeue # View all jobs
watch -n 30 squeue -u $USER # Monitor every 30 seconds
Job states: R means running, PD means pending waiting for resources. Press Ctrl+C to stop watch.
Q35: How do I cancel a job?
scancel <job_id> # Cancel a specific job
scancel -u $USER # Cancel all your jobs
Q36: How do I run an interactive job?
srun --pty bash # Compute node
srun --pty -p bigmem bash # Big memory node
srun -N 1 -n 40 --time=1:00:00 --partition=gpu --gres=gpu:1 --pty bash # GPU node
Interactive jobs end when you log off. Use batch jobs for long running work.
Q37: Which partition should I use for my job?
On Innovator: use compute for general jobs (default), bigmem if you need more than 256 GB memory, gpu for GPU/ML jobs with NVIDIA A100 GPUs, quickq for short test jobs under 12 hours. On Discovery: use compute for general jobs (default), gpu for standard GPU jobs with NVIDIA H100 GPUs, all-gpu to access all GPU nodes including large nodes with 4 GPUs and 1 TB RAM.
Q38: How do I submit a GPU job?
Add these lines to your job script:
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1 # Use gpu:2 for 2 GPUs
On Innovator GPU nodes have NVIDIA A100 80GB. On Discovery use gpu partition for NVIDIA H100 80GB nodes or all-gpu to include large GPU nodes.
Q39: What is the maximum time limit for jobs?
On both Innovator and Discovery most partitions allow up to 14 days (14-00:00:00). The quickq partition on Innovator has a 12-hour limit. Use quickq for testing and short jobs as they typically start faster.
Q40: How many CPUs can I request per node?
All nodes on both Innovator and Discovery have 48 CPUs per node. Set #SBATCH --ntasks-per-node up to 48 for maximum CPU usage per node.
Q41: How do I view available partitions and node status?
sinfo # View all partitions and node states
sinfo -o "%P %l %m" # View partition time limits and memory
Node states: idle = all resources available, mix = partially used, alloc = fully used.
Q42: How do I submit a Slurm job array?
#SBATCH --array=0-4694%25 # Submit all jobs, run 25 at a time
Use $SLURM_ARRAY_TASK_ID in your script to process each item. Always throttle large arrays using % to avoid overwhelming the scheduler. Load modules after all #SBATCH lines. Set a realistic --time for each individual task.
Q43: Why is my job not starting?
Common reasons: requested resources are not available (nodes fully allocated), time limit exceeds partition limit, requested more CPUs or memory than available per node, or high cluster utilization. Check squeue -u $USER to see your job status and reason in parentheses. Check sinfo to see node availability. Contact SDSU.HPC@sdstate.edu if the job remains pending unusually long.
Q44: How do I request memory for my job?
#SBATCH --mem=32G # 32 GB total per node
#SBATCH --mem-per-cpu=8G # 8 GB per CPU core
Do not request more memory than available: 256 GB for compute, 2 TB for bigmem, 512 GB for gpu on Innovator.
Q45: Where does my job output go?
By default output goes to slurm-<jobid>.out in the directory where you submitted the job. Specify a custom file with #SBATCH --output=mylog.log. Use %j to include the job ID: #SBATCH --output=myjob_%j.log. Check this file if your job fails.
Writing a Slurm Job Script — Line by Line Guide
Q46: What does #!/bin/bash mean in a Slurm script?
#!/bin/bash is called a shebang line. It must always be the very first line of your script. It tells the system to use the Bash shell to run this script. Without this line your script may not execute correctly.
Q47: What does #SBATCH --job-name do?
Gives your job a name that appears in the queue when you run squeue. Choose a short descriptive name. Example: #SBATCH --job-name=myjob. If not set, Slurm uses the script filename.
Q48: What does #SBATCH --nodes do?
Specifies how many compute nodes your job needs. Most jobs only need 1 node. Only set this higher if your software is designed to run across multiple nodes using MPI. Example: #SBATCH --nodes=1
Q49: What does #SBATCH --ntasks-per-node do?
Specifies how many CPU cores to use per node. The maximum on all nodes on both Innovator and Discovery is 48. Start with a smaller number like 4 or 8 unless your job is specifically designed to use many cores. Example: #SBATCH --ntasks-per-node=4
Q50: What does #SBATCH --output do?
Specifies the name of the log file where your job output will be saved. Example: #SBATCH --output=myjob.log. Use %j to include the job ID automatically: #SBATCH --output=myjob_%j.log
Q51: What does #SBATCH --partition do?
Tells Slurm which group of nodes to run your job on. On Innovator choose from: compute (default), bigmem, gpu, quickq. On Discovery choose from: compute (default), gpu, all-gpu. Example: #SBATCH --partition=compute
Q52: What does #SBATCH --time do?
Sets the maximum time your job is allowed to run. If exceeded the job is automatically cancelled. Format is days-hours:minutes:seconds. Examples:
#SBATCH --time=1-00:00:00 # 1 day
#SBATCH --time=8:00:00 # 8 hours
#SBATCH --time=0-01:30:00 # 1 hour 30 minutes
Set a reasonable estimate — do not always set 14 days if your job only needs a few hours.
Q53: What does #SBATCH --mem do?
Specifies the total memory your job needs per node. Example: #SBATCH --mem=32G. Do not request more than the node has: 256 GB for compute, 2 TB for bigmem, 512 GB for gpu nodes on Innovator.
Q54: What does #SBATCH --gres=gpu:1 do?
Requests 1 GPU for your job. You must include this when using the gpu or all-gpu partition, otherwise your job will not have access to any GPU. Use --gres=gpu:2 to request both GPUs on a node. Must be combined with --partition=gpu.
Q55: What does #SBATCH --array do?
Submits multiple similar jobs as a single job array. Example: #SBATCH --array=0-9 submits 10 jobs with indices 0 through 9. Use $SLURM_ARRAY_TASK_ID to reference each index. Always throttle: #SBATCH --array=0-999%20 runs 1000 jobs but only 20 at a time.
Q56: Can you show me a complete basic Slurm script with explanation?
#!/bin/bash # Always first line — use Bash shell
#SBATCH --job-name=myjob # Name your job
#SBATCH --nodes=1 # Request 1 compute node
#SBATCH --ntasks-per-node=4 # Use 4 CPU cores (max 48)
#SBATCH --mem=16G # Request 16 GB memory
#SBATCH --output=myjob_%j.log # Save output to log file
#SBATCH --partition=compute # Run on compute partition
#SBATCH --time=1-00:00:00 # Allow up to 1 day
module load python/3.11 # Load software (AFTER all #SBATCH lines)
python myscript.py # Your actual job command
Save as myjob.slurm and submit with: sbatch myjob.slurm
Q57: Can you show me a complete GPU Slurm script with explanation?
#!/bin/bash # Always first line
#SBATCH --job-name=gpu_job # Name your job
#SBATCH --nodes=1 # Request 1 GPU node
#SBATCH --ntasks-per-node=8 # Use 8 CPU cores
#SBATCH --mem=32G # Request 32 GB memory
#SBATCH --output=gpu_%j.log # Save output to log file
#SBATCH --partition=gpu # Use GPU partition
#SBATCH --gres=gpu:1 # Request 1 GPU (required!)
#SBATCH --time=8:00:00 # Allow up to 8 hours
module load cuda/11.8 # Load CUDA for GPU computing
module load python/3.11 # Load Python
python train_model.py # Your GPU training command
On Innovator: NVIDIA A100 80GB. On Discovery: NVIDIA H100 80GB. Save as gpu_job.slurm and submit with: sbatch gpu_job.slurm
Q58: How do I write a Slurm script for a job array?
#!/bin/bash
#SBATCH --job-name=array_job
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --mem=16G
#SBATCH --output=logs/%A_%a.log # %A = array job ID, %a = task index
#SBATCH --partition=compute
#SBATCH --time=2:00:00
#SBATCH --array=0-99%10 # 100 jobs, 10 at a time
module load python/3.11
echo "Processing task: $SLURM_ARRAY_TASK_ID"
python process_data.py --index $SLURM_ARRAY_TASK_ID
Create the logs folder first: mkdir -p logs. Then submit with: sbatch array_job.slurm
Q59: What are the most common mistakes when writing a Slurm script?
- Missing
#!/bin/bash on the first line
- Loading modules before the #SBATCH lines — modules must always come AFTER all #SBATCH directives
- Requesting more CPUs than available — max is 48 per node
- Not including
--gres=gpu:1 when using the gpu partition
- Setting --time too short so the job gets cancelled before finishing
- Using the wrong partition name — on Innovator use compute/bigmem/gpu/quickq, on Discovery use compute/gpu/all-gpu
- Requesting more memory than the node has available
- Copying scripts from email or documents where quote characters get changed and cause syntax errors
Software Modules
Q60: How do I find available software on the cluster?
Q61: How do I load a software module?
Load modules after all #SBATCH lines in your job scripts.
Q62: How do I check which modules are currently loaded?
Q63: How do I unload a module?
module unload python/3.11
Q64: The software I need is not available as a module, what do I do?
Submit a software request at https://help.sdstate.edu/TDClient/2744/Portal/Requests/ServiceDet?ID=53689. The RCi team will work with you to get the application installed on the cluster.
Open OnDemand
Q65: What is Open OnDemand?
Open OnDemand is a browser-based graphical interface for accessing HPC clusters without needing an SSH client. Through OnDemand you can open a web terminal, submit and monitor Slurm jobs, browse and manage files, launch interactive applications like Jupyter Notebooks and RStudio, and monitor cluster resources.
Q66: What is the URL for Innovator Open OnDemand?
Innovator Open OnDemand is accessible at https://ondemand.sdstate.edu. Log in with your email as first.lastname@jacks.sdstate.edu and your SDSU password.
Q67: What is the URL for Discovery Open OnDemand?
Discovery Open OnDemand is accessible at https://mydiscovery.sdstate.edu. Log in with your email as first.lastname@jacks.sdstate.edu and your SDSU password.
Q68: What applications can I launch through Open OnDemand?
Through Open OnDemand you can launch Jupyter Notebooks, RStudio sessions, and other web-based interactive applications. You can also access a web-based terminal, submit batch jobs, and manage your files directly in the browser.
Q69: Can I submit Slurm jobs through Open OnDemand?
Yes. Open OnDemand provides a job submission interface where you can submit, monitor, and manage Slurm jobs without using the command line. You can also view job status and cancel jobs through the web interface.
File Transfer
Q70: How do I transfer files to the cluster?
You can transfer files using SCP:
scp localfile.txt john.doe@jacks.local@innovator.sdstate.edu:/home/jacks.local/john.doe/
You can also use Globus for large data transfers, or the file manager in Open OnDemand for smaller files.
Q71: How do I use Globus to transfer data?
Create a Globus account at globus.org, install Globus Connect Personal on your local computer, then use the Globus web interface to transfer files between your computer and the cluster. Globus is recommended for large dataset transfers as it handles interruptions automatically.
Q72: Where should I store large datasets on the cluster?
Store large datasets in your scratch directory at /scratch/jacks.local/username. The scratch directory has no quota but is not backed up and a data expiration policy will be applied. Your home directory has a 100 GB quota and is intended for important persistent files.
Support and Contact
Q73: How do I contact HPC support?
- Email: SDSU.HPC@sdstate.edu
- Phone: 605-688-6776
- Support form: https://help.sdstate.edu/TDClient/2744/Portal/Requests/ServiceDet?ID=53689
Q74: My job failed, how do I get help?
First check your job output log file (slurm-<jobid>.out or the file specified with --output). Look for error messages. Common issues include wrong module names, incorrect file paths, or insufficient memory requests. If you cannot resolve it contact SDSU.HPC@sdstate.edu with your job ID and the error message.
Q75: How do I check my storage usage?
df -h /home # Check home directory usage
df -h /scratch # Check scratch usage
Contact SDSU.HPC@sdstate.edu if you need a quota increase or additional scratch storage.