Scheduling a Job¶
Slurm¶
The clusters run jobs based on a queue system provided by the software Slurm. Jobs are submitted to this scheduling software, assessed for priority, and then run on that cluster in order of priority. Priority is decided based on a few factors, such as how many jobs the user has run during that month on that cluster, how much time is requested for the job, how many cores and nodes are requested, and how long the job has waited. Individual user priority is reset at the start of every month.
Jobs must be run through the scheduler – those not run through the scheduler will be automatically stopped by the system as the high CPU usage from a computational task could cause instability and crashes when other vital system processes then have to compete for resources.
Writing a Basic Submission Script¶
Jobs scripts can be submitted to the scheduler using the command sbatch
.
These job scripts (sometimes called "batch scripts", "sbatch scripts", or
"submission scripts") start with a header containing slurm parameters; the
lines read by sbatch all start with #SBATCH
.
These slurm parameters tell the scheduler how to allocate resources,
and terminal commands, which encompasses what you actually want to run:
including shell commands, running scripts through other programs like R or
Python, or calling specific software.
Here we will discuss the slurm parameters and provide some basic options to help
you get started, but please see the
Slurm sbatch documentation for a
full list of options.
Queues¶
The queue (or partition) parameter tells the scheduler which queue the job should be submitted to and is specified as follows:
#SBATCH -p (partition name)
sinfo
.
The following table provides information about the queues on Borah:
Warning
“Preemptable” means that jobs submitted to this queue can be stopped and requeued if a higher priority job needs the resources. If the software you are using makes use of checkpointing, i.e., writing out intermediate data so the job doesn't need to restart from the beginning, you can make the best use of these queues.
Queue | Node Type | Max Nodes/User | Time limit (days) | Preemptable? | Total nodes |
---|---|---|---|---|---|
bsudfq | CPU | 8 | 28 | No | 38 |
bigmem | CPU (768 GB of RAM) | 1 | 7 | No | 1 |
short | CPU | no limit | 7 | Yes | 39 |
gpu | GPU | 2 | 7 | No | 6 |
gpu-l40 | GPU | 1 | 7 | No | 2 |
gpu-p100 | GPU | no limit | 7 | No | 1 |
gpu-v100 | GPU | 2 | 7 | No | 4 |
shortgpu | GPU | no limit | 7 | Yes | 9 |
shortgpu-a30 | GPU | no limit | 7 | Yes | 1 |
shortgpu-l40 | GPU | no limit | 7 | Yes | 2 |
shortgpu-p100 | GPU | no limit | 7 | Yes | 1 |
shortgpu-rtx8000 | GPU | no limit | 7 | Yes | 1 |
shortgpu-v100 | GPU | no limit | 7 | Yes | 4 |
Borah hardware specifications
The default queue on Borah is bsudfq
. Each node in this queue has 48 CPU
cores and 192GB of memory. See
Borah's Specifications for more information.
Note about the gpu and shortgpu queues
The gpu
and shortgpu
partitions have heterogeneous node configurations; the
gpu
queue contains nodes with V100 (2/node) and L40 (4/node) GPUs with 48
and 64 CPU cores, respectively. The shortgpu
queue contains, in addition to
the nodes in the gpu
queue, nodes with A30 (2/node), P100 (2/node), and
RTX8000 (8/node) GPUs with 64, 28, and 48 CPU cores, respectively.
If you would like to run on a specific GPU type, you can add the following
to your slurm scripts:
To request a specific GPU type when submitting to the gpu
or shortgpu
queue, for example an L40:
--gres=gpu:L40:1
and for one GPU regardless of type:
--gres=gpu:1
Or you can just submit to the specific GPU queue; e.g., submitting to the
gpu-v100
queue will ensure that you get on a node with a V100 GPU.
The above table provides information about the specifications of nodes in
the available queues, but you can also see these resources from the
command-line using sinfo.
The following example queries the node name, CPU cores, memory, and resources
of all the nodes in the shortgpu
queue:
sinfo -p shortgpu -o "%n %c %m %G"
Requesting Resources¶
In order to know how many resources you should request for your job, it helps to understand a little about the different resources.
Node vs. CPU vs. task
A node is a computer or server within the cluster, e.g., the login node or a compute node like cpu101. When we talk about the node we are talking about the entire machine; this encompasses the memory, network card, CPU(s), GPU(s), hard drive, etc.
A CPU, generally, is the processing unit of a computer, but slurm kind of fudges this definition to be the sockets, cores, or threads of a CPU. This Slurm resource provides more information.
A task is a part of a job parallelized using a distributed memory model, e.g., using MPI. For an overview of shared and distributed memory models, please check out Cornell Virtual Workshop: Memory Access.
If you are not sure how many resources to request, in general, we recommend requesting 1 node, 1 task, and a full node's worth of cores working on that task. For the default queue on Borah, this request would look like this:
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=48
If you are using a distributed memory framework, e.g., MPI, you may want to use more tasks with one or more core(s) per task. This can also distribute your job across multiple nodes by setting the nodes count to greater than one and the tasks count to greater than the number of cores per node. We recommend scaling multi-node jobs in increments of the number of cores per node; e.g., request a number of cores that can be evenly divided by the number of cores per node (\(n_{cores/node} \times N\)).
In addtion to nodes and cores, you can also request that your job run on a GPU. To request a GPU, add the following to your submission script:
#SBATCH --gres=gpu:(number of GPUs requested)
Timing, Output, and Naming¶
The scheduler relies on the the submission script to estimate how long a job will take. You can set a time limit for your job with the following line:
#SBATCH -t DD-HH:MM:SS
scontrol update job JOBID timelimit=NEWTIMELIMIT
JOBID
is the job id of the job you want to update and NEWTIMELIMIT
is
the new time limit.
Updating time limits
You will only be able to increase the time limit of a job that is still queued. Once a job is running, you will only be able to decrease the time limit.
If you need more time on a currently running job, please email us at researchcomputing@boisestate.edu.
If you want to change the name of the file where the script output is logged, you can use the following line:
#SBATCH -o (output file name)
slurm-(job number).out
.
If you want to give your job a unique name to identify it in the squeue
output, you can do so by adding the following line to your script:
#SBATCH -J (job name)
Finally, if you want to make sure that your job has exclusive access to an entire node--regardless of how many resources it is using--you can add the following line to your submission script:
#SBATCH --exclusive
Example submission scripts¶
The final component of an sbatch script is the set of commands that will be run after the slurm parameters are parsed. These can be any shell command, e.g., loading modules, setting environment variables, or calling executables. Let's say you've created an MPI executable (maybe your code is something like hello_world.c); you could use the following script to submit this job to the scheduler requesting 48 cores on a single CPU node:
#!/bin/bash
#SBATCH -J HELLOWORLD # job name
#SBATCH -o log_slurm.o%j # output and error file name (%j expands to jobID)
#SBATCH -n 48 # total number of tasks requested
#SBATCH -N 1 # number of nodes you want to run on
#SBATCH -p bsudfq # queue (partition)
#SBATCH -t 00:05:00 # run time (hh:mm:ss) - 5 minutes in this example
# (optional) Print some debugging information
echo "Date = $(date)"
echo "Hostname = $(hostname -s)"
echo "Working Directory = $(pwd)"
echo ""
echo "Number of Nodes Allocated = $SLURM_JOB_NUM_NODES"
echo "Number of Tasks Allocated = $SLURM_NTASKS"
echo ""
# Run the program
module load borah-base openmpi/4.1.3/gcc/12.1.0
mpirun -n 48 ./hello_world
As another example, here’s a job script requesting a GPU node and then running
nvidia-smi
to see information about the GPU:
#!/bin/bash
#SBATCH -J NVIDIASMI # job name
#SBATCH -o log_slurm.o%j # output and error file name (%j expands to jobID)
#SBATCH -c 1 # cpus per task
#SBATCH -N 1 # number of nodes you want to run on
#SBATCH --gres=gpu:1 # request a gpu
#SBATCH -p gpu # queue (partition)
#SBATCH -t 00:05:00 # run time (hh:mm:ss)
# (optional) Print some debugging information
echo "Date = $(date)"
echo "Hostname = $(hostname -s)"
echo "Working Directory = $(pwd)"
echo ""
echo "Number of Nodes Allocated = $SLURM_JOB_NUM_NODES"
echo "Number of Tasks Allocated = $SLURM_NTASKS"
echo "GPU Allocated = $SLURM_JOB_GPUS"
echo ""
nvidia-smi
These commands will differ from job to job. If you need any help setting up a submission script, please don't hesitate to email us at ResearchComputing@boisestate.edu--we're happy to help!
Once your script is ready, you can submit it using
sbatch (job script filename)
Submitted batch job (job number)
, which
tells you that your job was successfully submitted.
Other Useful Slurm Commands¶
scancel (job number)
: cancels a job; you can only cancel your own jobs.
squeue
: shows the current Slurm queue and all jobs in it.
squeue --me
: shows only your jobs in the current Slurm queue.
How to read squeue
A job that is currently running will have a state (ST
) code of R
.
The NODELIST (REASON)
column will tell you what nodes the job is running
on, or if it is not yet running, why it is waiting.
If a job has something like (Resources)
or (Priority)
in that column,
then it’s currently sitting in the queue waiting to be run.
(Resources)
means that the job has priority over other jobs in the queue
and is just waiting on resources to become available, while (Priority)
means that there are other jobs in the queue with a higher priority that
must be run before that job will run.
scontrol show job (job number)
: shows detailed job information.