Introduction to HPC Systems and Bash

Course site for HPC Systems PiCSciE/RC workshop

This project is maintained by PrincetonUniversity

Introduction to Job Scheduling and SLURM

Table of Contents

What’s a scheduler?

Schedulers help divide up the resources of a cluster fairly, give or take, and they also make sure jobs don’t trip over each other on the compute nodes.

All computational jobs for any cluster EXCEPT Nobel should be submitted to the scheduler.

We use SLURM (Simple Linux User Resource Management), which is one type of scheduling software.

SLURM basics

To run SLURM you need:

You should never run code directly on the head/login node except for ~ 5 - 10 minute test runs.

For detailed examples of these scripts, you can look at: Introduction Slurm

We’ll be following a recipe for a serial job.

A sample script

#!/bin/bash
# serial job using 1 node and 1 processor,
# and runs for 1 minute (max).
#SBATCH -N 1   # node count
#SBATCH --ntasks-per-node=1  # core count
#SBATCH -t 00:01:00
# sends mail when process begins, and
# when it ends. Make sure you define your email
#SBATCH --mail-type=begin
#SBATCH --mail-type=end
# remove the space in this line and change it to your NetID or email.
#SBATCH --mail-user= yourNetID@princeton.edu

echo 'Hello world!'

SLURM scripts are more or less Bash scripts (using the scripting syntax built into Bash) with some extra parameters to help out slurm.

You probably won’t ever need a one minute allocation, but this is a Hello world! script.

In this case, rather than running a program or code, we’re just using the echo fucntion to print out ‘Hello world!’ in good programming tradition.

Submitting a job

Assuming you wrote that script to a file called, test.cmd in your home directory:

cd ~
sbatch test.cmd

You’ll either receive an error if you have a problem in your script syntax or a job id.

You can use the job id to track your job, see its progress through the system, and see when it will run, etc.

Some quick utilities:

squeue -u username will show all jobs active or pending for username scontrol show jobid 12345 will show a detailed set of info about a job, including how many cores+nodes it is for, what node it’s running on, and its status. sshare -u username will show info about shares and usage for a particular username.

As a slurm job runs, unless you redirect output, a file named slurm-jobid.out will be produced in your home dir. You can use cat, less, or any editor to view it. It contains the output your program would have written to a terminal, if run interactively.

salloc and interactive testing on compute node

Say you have a job that might take days, but you want to make sure the code runs solidly for the first 30 minutes while keeping an eye on it.

You can’t do that on a login node, so what do you do?

salloc takes the same modifiers as follow the #SBATCH in your SLURM script. It will however send you in an interactive shell to the compute node once the allocation is granted! So it won’t be great to ask for many hours or days, but it can get you 30 minutes pretty quickly.

salloc -N 1 -n 1 -t 00:20:00 will ask for an allocation of 1 node, 1 task, and 20 minutes. Once it’s granted you’ll be in a shell where you can run processes directly that will be killed after the time elapses.

Considerations

Some things to think about:

Multicore Jobs

Sometimes you might want to run jobs using technologies like OpenMP or MPI. These are both ways of using more than one core. (There are certainly others, including array jobs.)

For these, you’ll be adjusting two parameters in your SLURM script, but you will need to be aware of two factors–whether you want more tasks or more cores per task.

Let’s say I want to run R substituting the Intel MKL BLAS for the built-in. MKL is mulithreading and will let me use more than one core on a node. (Though it will not let me use more than one node–for that you need MPI!)

#!/bin/bash
# serial job using 1 node and 3 processor,
# and runs for 15 minutes (max).
#SBATCH -N 1   # node count
#SBATCH --ntasks-per-node=1
#SBATCH -c 3  # core count
#SBATCH -t 00:15:00

module load intel intel-mkl
LD_PRELOAD=$MKLROOT/lib/intel64/libmkl_rt.so /usr/bin/Rscript test.R

Here I adjust the -c parameter to tell SLURM that I want my single task to be able to use three cores (since MKL will happily use the CPU power that way).

(Unless you’re an R user, don’t worry too much about the LD_PRELOAD, that’s just me forcing R to use the BLAS library that I would like, i.e. MKL)

In another situation, you might have an executable that uses MPI (Message Passing Interface) to use multiple cores, potentially even over multiple nodes.

#!/bin/bash
# serial job using 2 nodes and 20 processors,
# and runs for 1 hour (max).
#SBATCH -N 2   # node count
#SBATCH --output=arrayJob_%A_%a.out
#SBATCH --error=arrayJob_%A_%a.err
#SBATCH --ntasks-per-node=20
#SBATCH -t 01:00:00
# sends mail when process begins, and
# when it ends. Make sure you define your email


module load intel intel-mpi
srun ./a.out

This would request 20 x 2 nodes for 40 total processes for an hour.

Array Jobs

Array jobs are a different way to parallelize your computations. These use a set of variables that Slurm will set for you in the job.

#!/bin/bash
# array job using 1 nodes and 1 processor,
# and runs for five minutes max per task.
#SBATCH -J array_example
#SBATCH --output=array_example%A_%a.out
#SBATCH -N 1   # node count
#SBATCH --ntasks-per-node=1
#SBATCH -t 00:05:00
#SBATCH --array=0-5

# A few special parameters are set in this case that we echo below:

echo "My SLURM_ARRAY_JOB_ID is $SLURM_ARRAY_JOB_ID."
echo "My SLURM_ARRAY_TASK_ID is $SLURM_ARRAY_TASK_ID"

This will produce outputs with the job id and the individual task id that echo their subtask number. You can set the array numbers to any arbitrary set of numbers, so that you can subset processing a larger list by grabbing the value of $SLURM_ARRAY_TASK_ID. For example:

#SBATCH --array=0,100,200,300,400,500
./myprogram $SLURM_ARRAY_TASK_ID

This snippet shows a six task array, that will pass increments of 100 to the program in question. It can then start processing a data frame, for example, at rows 0, 100, 200, 300, 400, etc. and stop iterating after 99 rows. Thus if these arrays run in parallel, you would complete 600 rows.

GPUs

If your code can use a GPU, you might want to request one, as does the following script that wraps a Python job in tensorflow-gpu.

#!/bin/bash
# serial job using 1 GPU and 3 processors,
# and runs for 1 minute (max).
#SBATCH -N 1   # node count
#SBATCH --ntasks-per-node=3  # core count
#SBATCH --gres=gpu:1
#SBATCH -t 00:30:00
# sends mail when process begins, and
# when it ends. Make sure you define your email
#SBATCH --mail-type=begin
#SBATCH --mail-type=end
# remove the space in this line!
#SBATCH --mail-user= yourNetID@princeton.edu

module load anaconda3
conda activate tf-gpu
python my-tf-script.py

This asks for a single GPU (of any type) via --gres=gpu:1 You can specify more granularly too, --gres=gpu:tesla_v100:2 would ask for two Tesla v100s.

module load?

You may have seen the module load commands in the previous scripts. Now that you’ve looked at some Slurm scripts the tutorial on moduless will make much more sense.