Body
Overview
Innovator runs on the Rocky 9 Linux operating system and is made up of 3,072 CPU cores. The system consists of 46 compute nodes, 4 big memory nodes, 14 GPU nodes, and a 2 PB high performance GPFS filesystem. The cluster leverages 100 Gbps Infiniband for cluster data application processing and science data transfers, and 1 Gbps for cluster management. Jobs are submitted to the cluster worker nodes using Slurm, a job scheduling system.
Hardware Specifications
Hardware specifications for each node type on Innovator vary and are listed below:
Compute Nodes
- Quantity: (46) Dell PowerEdge R650
- CPUs: (2) Intel Xeon Gold 6342 CPU @ 2.80GHz (48 Cores)
- RAM: 256 GB
Big Memory Nodes
- Quantity: (4) Dell PowerEdge R750
- CPUs: (2) Intel Xeon Gold 6342 CPU @ 2.80GHz (48 Cores)
- RAM: 2 TB (expandable to 4TB)
GPU Nodes
- Quantity: (14) Dell PowerEdge R750
- CPUs: (2) Intel Xeon Gold 6342 CPU @ 2.80GHz (48 Cores)
- RAM: 512 GB
- Video Cards: (2) NVIDIA A100 80GB Cards per Node
Storage Services
Innovator is attached to a 3 PB high-performance Arcastream Pixstor GPFS parallel filesystem. This includes 2PB of usable storage for research, 512TB of Flash Tier Storage to facilitate faster read/write speeds, and the additional 512TB used for RCi software installations. Each user has a quota of 100GB in their home directory, however, for applications requiring a larger amount of storage, a scratch partition is provided upon request. No user quotas are implemented on the scratch partition, but a data expiration policy will eventually be applied.
Requesting Access to Innovator
The initial step for utilizing RCi's High-Performance Computing (HPC) systems begins with completing the onboarding form. Once this form is completed, the RCi team will reach out to you and schedule a quick meeting, if needed, to get you access.
Logging into Innovator
Innovator is a system-wide approach to HPC, meaning that all BOR institutions will have access to this resource. To facilitate that:
Faculty/Staff
Welcome to the secure login portal for faculty and staff. To access your account, please follow these simple steps:
1. Username: Your unique username is formed by combining your first name, a period, and your last name, followed by our domain. For instance, John Doe would use john.doe@jacks.local
followed by @innovator.sdstate.edu
:
ssh john.doe@jacks.local@innovator.sdstate.edu
2. Password: Enter your password in the provided field.
Example of a successful login command:
C:\Users\john.doe> ssh john.doe@jacks.local@innovator.sdstate.edu
john.doe@jacks.local@innovator.sdstate.edu's password:
Last login: Tue Dec 19 13:45:32 2023 from 137.216.48.170
[john.doe@jacks.local@cllogin002 ~]$ pwd
/home/jacks.local/john.doe
Note for Remote Users: If you are accessing from outside the SDSU campus, please replace @jacks.local
with your respective institution's domain, such as @usd.local
or @SDSMT.LOCAL
(this is case sensitive).
Students
Students should follow the same procedure using their student account credentials. For example, a student named John Doe would log in as jdoe@jacks.local
. See the example below.
ssh jdoe@jacks.local@innovator.sdstate.edu
If you encounter any technical issues or have any questions, please put in a request here: https://help.sdstate.edu/TDClient/2744/Portal/Requests/ServiceDet?ID=53689
Login Example (PuTTY)
PuTTY is a terminal emulator that offers a light and convenient SSH emulator. The figure below shows a basic PuTTY window:
In the Host Name field, you would enter innovator.sdstate.edu, or if you are more advanced, you could enter your first.last@jacks.local@innovator.sdstate.edu and then click open. Whatever way you choose, make sure your user name follows the convention in the above section.
In the case of the host key not being cached, you may receive a prompt, as illustrated:
You may select the option to Accept or Cancel according to your preference.
PuTTY will prompt you for your user name and/or password, depending on the login method, and after a successful authentication you will be logged in.
User Directories
Once logged in to Innovator, your default home directory depends on what institution you belong to. For example, people at SDSU would find their home address as /home/jacks.local/username whereas someone at SDSMT would land in /home/SDSMT.LOCAL/username (this is case sensitive). Your /scratch directory, should you request one, will follow the same directory layout except it would be /scratch. For example, at SDSU:
[john.doe@jacks.local@cllogin002 ~]$ pwd
/home/jacks.local/john.doe
[john.doe@jacks.local@cllogin002 ~]$ pwd
/scratch/jacks.local/john.doe
SLURM (Simple Linux Utility for Resource Management)
SLURM is an open-source job scheduler and workload manager primarily used in high-performance computing (HPC) environments. It allows users to submit, schedule, and manage jobs on clusters or supercomputers. Simply put, this is your go-to resource for scheduling jobs on the cluster. There are different ways of submitting your jobs on the cluster.
Node Types (Partitions)
A partition, also known as queue, is a subset of the cluster nodes - This collection of nodes have the same characteristics. There are 4 partitions on Innovator: Compute, Big Mem, and GPU. and Quick (quickq) Users can specify which partition (node type) to run a job on. If no partition is specified, the job will run on the compute partition, which are compute nodes only.
Node Definitions
Hardware specifications for each node type on Innovator vary and are listed below:
Compute Nodes
- Quantity: (46) Dell PowerEdge R650
- CPUs: (2) Intel Xeon Gold 6342 CPU @ 2.80GHz (48 Cores)
- RAM: 256 GB
Big Memory Nodes
- Quantity: (4) Dell PowerEdge R750
- CPUs: (2) Intel Xeon Gold 6342 CPU @ 2.80GHz (48 Cores)
- RAM: 2 TB (expandable to 4TB)
GPU Nodes
- Quantity: (14) Dell PowerEdge R750
- CPUs: (2) Intel Xeon Gold 6342 CPU @ 2.80GHz (48 Cores)
- RAM: 512 GB
- Video Cards: (2) NVIDIA A100 80GB Cards per Node
Viewing Node Usage and Partitions
To view the usage of the partitions, you can use a simple SLURM command called sinfo (show information) in your terminal window. For example:
[john.doe@jacks.local@cllogin002 ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
bigmem up 14-00:00:0 4 idle bigmem[001-004]
compute up 14-00:00:0 41 idle node[001-041]
gpu up 14-00:00:0 1 mix gpu001
gpu up 14-00:00:0 13 idle gpu[002-014]
quickq up 12:00:00 46 idle node[001-046]
This shows the Partition/Queue of the nodes on the cluster, their availability (up is good), the time limit of those partitions (default for most is 14 days), the amount of nodes in that partition along with what state the nodes are in (idle means all resources are available on the node, mix means that a job is running on the node but not using all of the resources, and alloc means that there are no resources left on the node). The sinfo command gives a quick overview of the state of the cluster.
To view a more in-depth look at cluster usage, you can use the squeue command, this will give you a long description, such as the following:
[john.doe@jacks.local@cllogin002 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8811139 compute merge username PD 0:00 1 (Dependency)
8589786 gpu Grom username R 33-05:02:01 1 gpu004
8580727 gpu P1 username R 36-16:11:52 1 gpu001
8580726 gpu GP2 username R 36-23:38:52 1 gpu002
8811081 compute bash username R 2:56:26 1 node009
8811080 compute bash username R 4:05:09 1 node003
8811259 compute bash username R 40:11 1 node012
8811258 compute bash username R 1:03:39 1 node011
8811075 compute bash username R 1-04:27:56 1 node001
8811074 bigmem Gamma username R 1-21:58:19 1 big-mem003
...
This shows much more information, including JobID numbers, that partitions they are run on, along with the job name and user. ST refers to the job state (R means running, PD means your jobs is awaiting resources to free up), along with run times and what nodes they are on.
Some simple examples to identify your running jobs is to simply type the following command in your terminal:
[john.doe@jacks.local@cllogin002 ~]$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8811378 quickq short400 john.doe R 0:07 1 node057
Some jobs may fail early or perhaps you want to make sure they queue properly, to do that, Linux has a watch command and you would type:
[john.doe@jacks.local@cllogin002 ~]$ watch -n 30 squeue -u $USER
Every 30.0s: squeue -u john.doe@jacks.local cllogin002: Tue Jan 9 08:39:38 2024
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8811378 quickq short400 john.doe R 4:10 1 node057
This example will update your submitted jobs every 30 seconds (n -30), hit ctrl+c to cancel the display.
A short list of common SLURM commands can be referenced here: https://help.sdstate.edu/TDClient/2744/Portal/KB/ArticleDet?ID=135416
Modules
Modules are simply the applications you use to process your research on the cluster. There are required modules that load when you sign in, such as SLURM, GCC, and other applications for getting started on the cluster. For example, when you log in you can type module list command:
[john.doe@jacks.local@cllogin002 ~]$ module list
Currently Loaded Modules:
1) shared 2) slurm/slurm/21.08.8 3) StdEnv 4) gcc/11.2.0
This shows the modules that are loaded by default. In addition, to find the available modules (applications) already built on the cluster, you can use the module available command. I won't show a picture here as it takes quite some space to do so.
To load an application (module) can type the module load <module name>. As an example, let us pretend that I need to load a different version of python for my code to work, to do so I would do the following:
First, I may check to see which version I am currently running,
[john.doe@jacks.local@cllogin002 ~]$ python --version
Python 3.9.14
Now my python application needs a new version (or older) so I would use a module to load a new version (here I would type module avail to list the modules),
[john.doe@jacks.local@cllogin002 ~]$ module load python/3.11
[john.doe@jacks.local@cllogin002 ~]$ Python 3.11.5
Modules allow for easy application usage on the cluster. Whatever application and version you need, will be added to your PATH so you can have multiple modules loaded and run different versions of applications.
If you made a mistake in loading the wrong module, you can simply type module unload <module name> and that command will remove that from your PATH.
If you don't see your application listed or have trouble with applications, please complete the following request and we will get to work with you on whatever issues or questions you have: https://help.sdstate.edu/TDClient/2744/Portal/Requests/ServiceDet?ID=53689
Running a job on Innovator
There are two main types of jobs that you can run on Innovator: interactive jobs and batch jobs.
Interactive jobs involve the user requesting a node on the cluster via Slurm, then performing jobs by directly typing commands on the command line. Interactive jobs will end if the user logs off of the cluster.
The other job type, batch jobs, are jobs that are designed to run one or more scripts. The batch job is submitted to the scheduler, which runs the job on the selected nodes using a job submission file (sbatch file). Options such as node type, number of nodes, number of threads, etc. are specified in the sbatch file. These jobs will continue to run if the user logs off of the cluster. Instead of displaying output in your terminal, the output will go to a log file instead.
Running an Interactive Job
Interactive jobs on the cluster can be started with the Slurm command srun. To use one node in the default partition, which is comprised of compute nodes only, the following command can be used:
[john.doe@jacks.local@cllogin002 ~]$ srun --pty bash
[john.doe@jacks.local@node040 ~]$ Python 3.11.5
To specify a different node type, such as big memory or GPU, use the srun examples below:
[john.doe@jacks.local@cllogin002 ~]$ srun --pty -p bigmem bash
[john.doe@jacks.local@big-mem003 ~]$
[john.doe@jacks.local@cllogin002 ~]$ srun -N 1 -n 40 --time=1:00:00 --partition=gpu --gres=gpu:1 --pty bash
[john.doe@jacks.local@gpu001 ~]$
You can see that once you execute an srun command, it will place you on a free node with the resources you requested. As with all jobs on the cluster, srun is limited by available resources, so you may have to wait until a node opens up.
Running a Batch Job
On Innovator, the way to create a batch job is to first write a job submission script. A job submission script is a shell script that contains comments that are prefixed with #SBATCH. These comments are understood by Slurm as parameters requesting resources and other submission options.
We have a list of the most commonly used parameters on our Slurm commands page. This is an example of a basic, single-threaded job submission script for use on the Innovator cluster:
#!/bin/bash
#SBATCH --job-name=test #Job Name
#SBATCH --nodes=1 #Number of nodes
#SBATCH --ntasks-per-node=4 #CPUs per node - max 48 for all nodes
#SBATCH --output=log.log #What your log file will be called
#SBATCH --partition=compute #Type of node requested: compute, quickq, bigmem, or GPU
#SBATCH --time=1-00:00:00 #Time limit days-hrs:min:sec
module load <module name>
##Add any additional modules you need above.
##Your job instructions will go below this line.
You would then save this file as something.slurm and submit to the cluster using:
[john.doe@jacks.local@cllogin002 ~]$ sbatch something.slurm
Submitted batch job 334
If you have any questions, please feel free to reach us at: https://help.sdstate.edu/TDClient/2744/Portal/Requests/ServiceDet?ID=53689 or by calling the SDSU Support Desk at 605-688-6776.
Open OnDemand
Open OnDemand is an open-source web application that provides a user-friendly interface for accessing and managing HPC resources. Through OnDemand, you can submit jobs, monitor the status of your jobs, get access to a shell on Innovator, and launch web-based applications like Jupyter Notebooks and RStudio sessions. OnDemand for Innovator can be accessed here: ondemand.sdstate.edu. You can log in with the email address and password associated with your Innovator account.
Questions or Problems
You can reach us anytime by filling out the following form: https://help.sdstate.edu/TDClient/2744/Portal/Requests/ServiceDet?ID=53689 or by email at SDSU.HPC@sdstate.edu
Thank you so much for your support,
SDSU RCi