Innovator Overview

Summary

Innovator is comprised of 46 compute nodes, 4 bigmem nodes, 14 GPU nodes and 2PB of usable storage.

Body

Overview

Innovator runs on the Rocky 9 Linux operating system and is made up of 3,072 CPU cores. The system consists of 46 compute nodes, 4 big memory nodes, 14 GPU nodes, and a 2 PB high performance GPFS filesystem. The cluster leverages 100 Gbps Infiniband for cluster data application processing and science data transfers, and 1 Gbps for cluster management. Jobs are submitted to the cluster worker nodes using Slurm, a job scheduling system.

Hardware Specifications

Hardware specifications for each node type on Innovator vary and are listed below:

Compute Nodes

  • Quantity: (46) Dell PowerEdge R650
  • CPUs: (2) Intel Xeon Gold 6342 CPU @ 2.80GHz (48 Cores)
  • RAM: 256 GB

Big Memory Nodes

  • Quantity: (4) Dell PowerEdge R750
  • CPUs: (2) Intel Xeon Gold 6342 CPU @ 2.80GHz (48 Cores)
  • RAM: 2 TB (expandable to 4TB)

GPU Nodes

  • Quantity: (14) Dell PowerEdge R750
  • CPUs: (2) Intel Xeon Gold 6342 CPU @ 2.80GHz (48 Cores)
  • RAM: 512 GB
  • Video Cards: (2) NVIDIA A100 80GB Cards per Node

Storage Services

Innovator is attached to a 3 PB high-performance Arcastream Pixstor GPFS parallel filesystem. This includes 2PB of usable storage for research, 512TB of Flash Tier Storage to facilitate faster read/write speeds, and the additional 512TB used for RCi software installations.  Each user has a quota of 100GB in their home directory, however, for applications requiring a larger amount of storage, a scratch partition is provided upon request. No user quotas are implemented on the scratch partition, but a data expiration policy will eventually be applied.

Requesting Access to Innovator

The initial step for utilizing RCi's High-Performance Computing (HPC) systems begins with completing the onboarding form. Once this form is completed, the RCi team will reach out to you and schedule a quick meeting, if needed, to get you access. 

Logging into Innovator

Innovator is a system-wide approach to HPC, meaning that all BOR institutions will have access to this resource.  To facilitate that:

Faculty/Staff

Welcome to the secure login portal for faculty and staff. To access your account, please follow these simple steps:

1. Username: Your unique username is formed by combining your first name, a period, and your last name, followed by our domain. For instance, John Doe would use john.doe@jacks.local followed by @innovator.sdstate.edu:

ssh john.doe@jacks.local@innovator.sdstate.edu

2. Password: Enter your password in the provided field.

Example of a successful login command:

C:\Users\john.doe> ssh john.doe@jacks.local@innovator.sdstate.edu

john.doe@jacks.local@innovator.sdstate.edu's password:
Last login: Tue Dec 19 13:45:32 2023 from 137.216.48.170
 

[john.doe@jacks.local@cllogin002 ~]$ pwd
/home/jacks.local/john.doe

Note for Remote Users: If you are accessing from outside the SDSU campus, please replace @jacks.local with your respective institution's domain, such as @usd.local or @SDSMT.LOCAL (this is case sensitive).

Students

Students should follow the same procedure using their student account credentials. For example, a student named John Doe would log in as jdoe@jacks.local. See the example below.

ssh jdoe@jacks.local@innovator.sdstate.edu

If you encounter any technical issues or have any questions, please put in a request here: https://help.sdstate.edu/TDClient/2744/Portal/Requests/ServiceDet?ID=53689

Login Example (PuTTY)

PuTTY is a terminal emulator that offers a light and convenient SSH emulator.  The figure below shows a basic PuTTY window:

Uploaded Image (Thumbnail)

In the Host Name field, you would enter innovator.sdstate.edu, or if you are more advanced, you could enter your first.last@jacks.local@innovator.sdstate.edu and then click open.  Whatever way you choose, make sure your user name follows the convention in the above section.

In the case of the host key not being cached, you may receive a prompt, as illustrated:

Uploaded Image (Thumbnail)

You may select the option to Accept or Cancel according to your preference.

PuTTY will prompt you for your user name and/or password, depending on the login method, and after a successful authentication you will be logged in.

User Directories

Once logged in to Innovator, your default home directory depends on what institution you belong to.  For example, people at SDSU would find their home address as /home/jacks.local/username whereas someone at SDSMT would land in /home/SDSMT.LOCAL/username (this is case sensitive).  Your /scratch directory, should you request one, will follow the same directory layout except it would be /scratch.  For example, at SDSU:

[john.doe@jacks.local@cllogin002 ~]$ pwd

/home/jacks.local/john.doe

[john.doe@jacks.local@cllogin002 ~]$ pwd

/scratch/jacks.local/john.doe

SLURM (Simple Linux Utility for Resource Management)

SLURM is an open-source job scheduler and workload manager primarily used in high-performance computing (HPC) environments. It allows users to submit, schedule, and manage jobs on clusters or supercomputers. Simply put, this is your go-to resource for scheduling jobs on the cluster.  There are different ways of submitting your jobs on the cluster.

Node Types (Partitions)

A partition, also known as queue, is a subset of the cluster nodes - This collection of nodes have the same characteristics. There are 4 partitions on Innovator: Compute, Big Mem, and GPU. and Quick (quickq) Users can specify which partition (node type) to run a job on. If no partition is specified, the job will run on the compute partition, which are compute nodes only.

Node Definitions

Hardware specifications for each node type on Innovator vary and are listed below:

Compute Nodes

  • Quantity: (46) Dell PowerEdge R650
  • CPUs: (2) Intel Xeon Gold 6342 CPU @ 2.80GHz (48 Cores)
  • RAM: 256 GB

Big Memory Nodes

  • Quantity: (4) Dell PowerEdge R750
  • CPUs: (2) Intel Xeon Gold 6342 CPU @ 2.80GHz (48 Cores)
  • RAM: 2 TB (expandable to 4TB)

GPU Nodes

  • Quantity: (14) Dell PowerEdge R750
  • CPUs: (2) Intel Xeon Gold 6342 CPU @ 2.80GHz (48 Cores)
  • RAM: 512 GB
  • Video Cards: (2) NVIDIA A100 80GB Cards per Node

 

Viewing Node Usage and Partitions

To view the usage of the partitions, you can use a simple SLURM command called sinfo (show information) in your terminal window.  For example:

[john.doe@jacks.local@cllogin002 ~]$ sinfo

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
bigmem       up 14-00:00:0      4   idle bigmem[001-004]
compute      up 14-00:00:0     41   idle node[001-041]
gpu          up 14-00:00:0      1   mix gpu001
gpu          up 14-00:00:0     13   idle gpu[002-014]
quickq       up   12:00:00     46   idle node[001-046]

This shows the Partition/Queue of the nodes on the cluster, their availability (up is good), the time limit of those partitions (default for most is 14 days), the amount of nodes in that partition along with what state the nodes are in (idle means all resources are available on the node, mix means that a job is running on the node but not using all of the resources, and alloc means that there are no resources left on the node).  The sinfo command gives a quick overview of the state of the cluster.

To view a more in-depth look at cluster usage, you can use the squeue command, this will give you a long description, such as the following:

[john.doe@jacks.local@cllogin002 ~]$ squeue

  JOBID PARTITION     NAME     USER      ST       TIME  NODES NODELIST(REASON)
8811139   compute     merge   username  PD       0:00      1 (Dependency)
8589786       gpu     Grom    username  R 33-05:02:01      1 gpu004
8580727       gpu     P1      username  R 36-16:11:52      1 gpu001
8580726       gpu     GP2     username  R 36-23:38:52      1 gpu002
8811081   compute     bash    username  R     2:56:26      1 node009
8811080   compute     bash    username  R     4:05:09      1 node003
8811259   compute     bash    username  R       40:11      1 node012
8811258   compute     bash    username  R     1:03:39      1 node011
8811075   compute     bash    username  R  1-04:27:56      1 node001
8811074    bigmem     Gamma   username  R  1-21:58:19      1 big-mem003
...

This shows much more information, including JobID numbers, that partitions they are run on, along with the job name and user.  ST refers to the job state (R means running, PD means your jobs is awaiting resources to free up), along with run times and what nodes they are on. 

Some simple examples to identify your running jobs is to simply type the following command in your terminal:

[john.doe@jacks.local@cllogin002 ~]$ squeue -u $USER

JOBID    PARTITION     NAME     USER      ST       TIME  NODES   NODELIST(REASON)
8811378    quickq   short400    john.doe  R       0:07      1    node057

Some jobs may fail early or perhaps you want to make sure they queue properly, to do that, Linux has a watch command and you would type:

[john.doe@jacks.local@cllogin002 ~]$ watch -n 30 squeue -u $USER

Every 30.0s: squeue -u john.doe@jacks.local                                  cllogin002: Tue Jan 9 08:39:38 2024

           JOBID    PARTITION     NAME          USER      ST       TIME      NODES NODELIST(REASON)
           8811378  quickq        short400      john.doe   R       4:10      1     node057

This example will update your submitted jobs every 30 seconds (n -30), hit ctrl+c to cancel the display.

A short list of common SLURM commands can be referenced here: https://help.sdstate.edu/TDClient/2744/Portal/KB/ArticleDet?ID=135416

Modules

Modules are simply the applications you use to process your research on the cluster.  There are required modules that load when you sign in, such as SLURM, GCC, and other applications for getting started on the cluster.  For example, when you log in you can type module list command:

[john.doe@jacks.local@cllogin002 ~]$ module list

Currently Loaded Modules:
  1) shared   2) slurm/slurm/21.08.8   3) StdEnv   4) gcc/11.2.0

This shows the modules that are loaded by default.  In addition, to find the available modules (applications) already built on the cluster, you can use the module available command. I won't show a picture here as it takes quite some space to do so. 

To load an application (module) can type the module load <module name>.  As an example, let us pretend that I need to load a different version of python for my code to work, to do so I would do the following:

First, I may check to see which version I am currently running,

[john.doe@jacks.local@cllogin002 ~]$ python --version

Python 3.9.14

Now my python application needs a new version (or older) so I would use a module to load a new version (here I would type module avail to list the modules),

[john.doe@jacks.local@cllogin002 ~]$ module load python/3.11

[john.doe@jacks.local@cllogin002 ~]$ Python 3.11.5

Modules allow for easy application usage on the cluster.  Whatever application and version you need, will be added to your PATH so you can have multiple modules loaded and run different versions of applications.

If you made a mistake in loading the wrong module, you can simply type module unload <module name> and that command will remove that from your PATH.

If you don't see your application listed or have trouble with applications, please complete the following request and we will get to work with you on whatever issues or questions you have: https://help.sdstate.edu/TDClient/2744/Portal/Requests/ServiceDet?ID=53689

Running a job on Innovator

There are two main types of jobs that you can run on Innovator: interactive jobs and batch jobs.

Interactive jobs involve the user requesting a node on the cluster via Slurm, then performing jobs by directly typing commands on the command line. Interactive jobs will end if the user logs off of the cluster.

The other job type, batch jobs, are jobs that are designed to run one or more scripts. The batch job is submitted to the scheduler, which runs the job on the selected nodes using a job submission file (sbatch file). Options such as node type, number of nodes, number of threads, etc. are specified in the sbatch file. These jobs will continue to run if the user logs off of the cluster. Instead of displaying output in your terminal, the output will go to a log file instead.

Running an Interactive Job

Interactive jobs on the cluster can be started with the Slurm command srun. To use one node in the default partition, which is comprised of compute nodes only, the following command can be used:

[john.doe@jacks.local@cllogin002 ~]$ srun --pty bash

[john.doe@jacks.local@node040 ~]$ Python 3.11.5

To specify a different node type, such as big memory or GPU, use the srun examples below:

[john.doe@jacks.local@cllogin002 ~]$ srun --pty -p bigmem bash

[john.doe@jacks.local@big-mem003 ~]$

[john.doe@jacks.local@cllogin002 ~]$ srun -N 1 -n 40 --time=1:00:00 --partition=gpu --gres=gpu:1 --pty bash

[john.doe@jacks.local@gpu001 ~]$

You can see that once you execute an srun command, it will place you on a free node with the resources you requested.  As with all jobs on the cluster, srun is limited by available resources, so you may have to wait until a node opens up.

Running a Batch Job

On Innovator, the way to create a batch job is to first write a job submission script. A job submission script is a shell script that contains comments that are prefixed with #SBATCH. These comments are understood by Slurm as parameters requesting resources and other submission options.

We have a list of the most commonly used parameters on our Slurm commands page. This is an example of a basic, single-threaded job submission script for use on the Innovator cluster:

#!/bin/bash
#SBATCH --job-name=test        #Job Name
#SBATCH --nodes=1              #Number of nodes
#SBATCH --ntasks-per-node=4    #CPUs per node - max 48 for all nodes
#SBATCH --output=log.log       #What your log file will be called
#SBATCH --partition=compute    #Type of node requested: compute, quickq, bigmem, or GPU
#SBATCH --time=1-00:00:00      #Time limit days-hrs:min:sec

module load <module name>

##Add any additional modules you need above.
##Your job instructions will go below this line. 

You would then save this file as something.slurm and submit to the cluster using:

[john.doe@jacks.local@cllogin002 ~]$ sbatch something.slurm

Submitted batch job 334

If you have any questions, please feel free to reach us at: https://help.sdstate.edu/TDClient/2744/Portal/Requests/ServiceDet?ID=53689 or by calling the SDSU Support Desk at 605-688-6776.

Open OnDemand

Open OnDemand is an open-source web application that provides a user-friendly interface for accessing and managing HPC resources. Through OnDemand, you can submit jobs, monitor the status of your jobs, get access to a shell on Innovator, and launch web-based applications like Jupyter Notebooks and RStudio sessions. OnDemand for Innovator can be accessed here: ondemand.sdstate.edu. You can log in with the email address and password associated with your Innovator account.

Questions or Problems

You can reach us anytime by filling out the following form: https://help.sdstate.edu/TDClient/2744/Portal/Requests/ServiceDet?ID=53689 or by email at SDSU.HPC@sdstate.edu

Thank you so much for your support,

SDSU RCi

Details

Details

Article ID: 159685
Created
Thu 1/4/24 12:23 PM
Modified
Wed 9/18/24 10:27 AM