Resource Efficiency on HPC Resources and You

Testing Job Efficiency on HPC Systems

HPC Clusters operate differently from single node servers.  Both will be covered below, keeping in mind that there will be overlap in some areas as they share the same tools.

Investigating Job Efficiency Via the Command Line

So you submitted your first job on the cluster using Slurm sbatch, congratulations!  Now you can check, in real time, how efficient your resource calls are.  Nodes are simply the servers that your job is running on.  Think of an HPC cluster as a large conglomeration of server, attached via high speed networks to do your bidding.  The simplest way to get real time monitoring is outlined below.  If your job has already completed, you can skip to the next section called Using seffto Check Job Efficiency After Job Completion (HPC Clusters).

Step 1: Accessing Nodes viaSSH

Determining the Node(s) your Job is using: Here you can invoke the squeue command, or better yet, filter squeue by your username.

[userl@cllogin002]$ squeue -u $USER
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           3197426   compute edge_exa user  R       0:16      1 node003

Here you can see, under the NODELIST, that this job is running on node003. 

Accessing a Node:  Now that you know your node(s) that your job is running on, you can use the following command to ssh into a node (note, you may get a warning, this is ok):

[user@cllogin002]$ ssh node003
Warning: Permanently added 'node003' (ED25519) to the list of known hosts.
[user@node003 ~]$

This puts you on the specific node allowing you to continue on to the next step.  If you get an error, please contact us at sdsu.hpc@sdstate.edu and we can assist. 

Step 2: Open top to Monitor Resource Usage (Also used on Single Node Servers (i.e. Dune and Fennec)

Running top: Once logged into the node, type the following command to launch the top command:

[user@node003 ~]$ top
This will display real-time system performance metrics and it can get a bit messy.  You will see something similar to this:
 

top - 11:58:14 up 97 days, 23:18,  1 user,  load average: 12.00, 12.17, 12.18
Tasks: 542 total,   2 running, 540 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.6 us,  0.0 sy,  0.0 ni, 87.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 257376.9 total, 109789.4 free,  92372.4 used,  97778.7 buff/cache
MiB Swap:  16384.0 total,   5664.3 free,  10719.7 used. 165004.5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND 

 
While this may look like a wall of text at first, it does tell us all that is going on with the system, including the processes running on system but we will get to the processes a bit later on.  Let's break this down:
 
  1. Top - Time, Users, Load Average (First Line) - This is simply telling us that the command that is running is topalong with how many users are logged into that system and the load average.  Fun fact, the load average is a quick look at how busy the node is:  In this case, all nodes are 48 cores and the average over 1, 5 and 15 minutes is 12.00ish meaning 12 cores are currently being used or about 25% of the node.  Load average ≈ number of cores → Fully utilized, not overloaded is a great use of node resources.
  2.  Tasks - This is telling you that 542 total processes are running on the system. Of those 2 are actively running on the CPU, 540 are idle or waiting for something (like input/output) there are no stuck of orphaned jobs. 
  3. CPU Usage - Almost mimics what is in step 1, saying that  us (user): 12.6%  is the time the CPU spends running user processes (your jobs).
  4. These next two lines just break down memory usage, so Total: 257,376.9 MiB (That’s your total system RAM), Free: 109,789.4 MiB Memory not currently being used at all, Used: 92,372.4 MiB Memory actively used by running programs along with Buff/Cache: 97,778.7 MiB memory used for caching to help speed things up, but this space can be reclaimed if needed.  Swap is a portion of disk space that acts like “emergency backup memory” when your system runs out of RAM.
  5. Usage Metric Definitions: PID (Process ID): The unique identifier for each process running on the system, USER: The user who owns the process, %CPU: The percentage of CPU the process is using, %MEM: The percentage of RAM the process is using, TIME+: Total CPU time the process has consumed, and COMMAND: The name of the command or program that initiated the process.

Now to bring this full circle, if everything is working and your job is still running after reading all of this, you should see your job(s) running and see something similar to the following:

2040169 user+  20   0  135.6g 124.8g  89.0g R 598.7  49.7 373184:56 R 
 
While your jobs will be different than my example, this is simply telling you the following:   This process is actively running (R), using 125 GB of RAM and about 600% CPU (likely across 6 cores), and has been running for quite some time. The process/job/application is using ~50% of the nodes available memory as well.  
 

Step 3: Putting it all Together

When you submitted your job via sbatch on the command line, you should have done so with a slurm script with resource requests, something akin to the following:

#SBATCH --ntasks=1                           #Specifies the maximum amount of tasks that can be executed in parallel
#SBATCH --cpus-per-task=24             #Used to run the multithreaded task using X CPU-cores
#SBATCH --mem=500G                      #Defines the amount of memory you need for your job to run (Required)

If we use this for the example, I can look back at what top was reporting and determine if I made the right resource request. 

  1. This job requested 1 task and I am indeed only running one task.  
  2. This job requested 24 cores, but we can see that I am only efficiently using 6 cores.
  3. This job requested 500GiB of memory yet it is only using half of that.

With this information, I can update my job script to better utilize resources on the cluster with something like the following:

#SBATCH --ntasks=1                           #Specifies the maximum amount of tasks that can be executed in parallel
#SBATCH --cpus-per-task=6               #Used to run the multithreaded task using X CPU-cores
#SBATCH --mem=130G                      #Defines the amount of memory you need for your job to run (Required)

 

Using seffto Check Job Efficiency After Job Completion (HPC Clusters)

Built into Slurm, is a tool that can be used to check efficiency only after the job has been completed.  You will need your job ID and will need to run this in a timely fashion as records are not kept indefinitely.  The seff command will output key efficiency metrics, such as: Job Efficiency which is the percentage of allocated resources (CPU, memory, etc.) that were effectively used by the job, Elapsed Time: The total runtime of the job, Requested vs. Allocated Resources showing how much CPU or memory you requested versus what was allocated to your job, and Memory Utilization showing how much of the allocated memory was actually used.

When you submit your job, you will be given a job ID number, in this example, the job ID is 3198769  (*Note, your job ID will be different):

[user@cllogin002]$ sbatch yourjob.sbatch
Submitted batch job 3198769
#My slurm resource request:
#SBATCH --ntasks=1                      #Specifies the maximum amount of tasks that can be executed in parallel
#SBATCH --cpus-per-task=24              #Used to run the multithreaded task using X CPU-cores
#SBATCH --time=0-00:15:00               #Time requested for your job to run.  Format days-hours:minutes:seconds
#SBATCH --mem=500M                      #Defines the amount of memory you need for your job to run (Required)

Usingseffafter job id 3198769 ran, you will see output similar to the following:

[user@cllogin002]$ seff 3198769
Job ID: 3198769
Cluster: slurm
User/Group: /domain users
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 24
CPU Utilized: 00:03:54
CPU Efficiency: 7.74% of 00:50:24 core-walltime
Job Wall-clock time: 00:02:06
Memory Utilized: 4.94 GB
Memory Efficiency: 1010.73% of 500.00 MB

We really only need to focus on the information after the completed state.  With this information, we can see that my example job could be refined in terms of resources.

  1.  State: COMPLETED (exit code 0) - The job ran without any errors, which is always nice.
  2. Nodes: 1, Cores per node: 24 -  My job used 1 node and was allocated 24 CPU cores on that node.
  3.  CPU Utilized: 00:03:54, CPU Efficiency: 7.74% of 00:50:24 core-walltime -  I requested 24 CPU cores but only used about 7.74% of that. That’s a sign I asked for more CPU than I needed and I should adjust my resource request.
  4. Job Wall-clock time: 00:02:06 - My job only ran for a little over 2 minutes yet I requested 15 minutes.
  5. Memory Utilized: 4.94 GB, Memory Efficiency: 1010.73% of 500.00 MB - My job used 4.94 GB of RAM, even though you only requested 500 MBI under-requested memory, but the system let it slide because spare memory was available to use.  I need to adjust my sbatch script to avoid potential job failure on full systems.

In summary, my job finished fine, but it used a lot more memory than requested and only a little bit of the CPU resources. This will help me rewrite my resource request to something like the following:

#SBATCH --ntasks=1                      #No Changes
#SBATCH --cpus-per-task=2           #Updated to 2 cores based on 8% of the 24 cores I requested
#SBATCH --time=0-00:05:00           #Updated per Job Wall-clock time as I don't need 15 minutes
#SBATCH --mem=5G                      #Updated to 5GB per Memory Utilization

With these changes, I can make better use of the resources provided by SDSU RCI.

Was this helpful?
0 reviews