Also see Getting Started (Using Software) or slides from the BlueHive workshop.

slurm

BlueHive uses the Simple Linux Utility for Resource Management, or slurm, as the job scheduler.
The most common slurm commands you will need are

command description
sbatch my.script submits a job script to the scheduler
squeue lists pending, running, or recently completed jobs
scancel JobID cancels a pending or running job
sinfo -s prints information about job partitions

Jobs are submitted to a queue, or partition and each partition has different resource limits associated with it. The best partition to use will depend on

  • The total wall time required
  • The amount of memory required per process
  • The number of cores required per process
  • The number of total processes
  • Whether you need a cuda capable GPU
  • Whether you need a 3D accelerated remote desktop
  • Whether you need an intel phi coprocessor

Here is a table of the publicly available queues along with job and user limits.

Partition Description Wall-Time Nodes Cores/Node Memory/Node Nodes/User Cores/User Jobs/User
debug For short debugging jobs 1:00:00 20 22 62GB -- 48 2
gpu For jobs using GPU coprocessors 5-00:00:00 33 36 125GB 25 120 12
gpu-debug For short debugging jobs with a gpu 1:00:00 5 24 62GB -- -- 1
gpu-interactive For interactive sessions with a gpu 8:00:00 1 36 125GB -- -- 2
interactive For interactive sessions 8:00:00 17 22 125GB -- 40 2
phi For jobs using Intel Phi coprocessors 5-00:00:00 8 24 62GB -- 120 8
preempt For running jobs that can be preempted 1-00:00:00 132 64 1000GB -- 120 120
standard Default partition 5-00:00:00 114 36 2929GB -- 120 120
visual For GPU-accelerated visualization 5-00:00:00 10 12 23GB -- 96 1

A first sbatch script

#!/bin/sh
#SBATCH --partition=debug --time=00:05:00 --output=myoutput.txt
hostname

This script will simply print out the hostname of the compute node that it runs on. You can save the script in a file called my.sbatch and submit it to run by typing sbatch my.sbatch. When you do so you should see a line such as

Submitted batch job 2615092    

The number is the JobID which you can use to cancel the job or request more information about it. After you submit the job you can type squeue -u YourNetIDHere. Your job should be listed. The fifth column is a code that gives the current state of the job, PD for pending (waiting to run), R for running, C for completed. After the job is completed a file myoutput.txt should be created containing a line such as

bhc0001

Additional options to sbatch scripts

Many options can be selected using either a long name such as --partition or a short name -p. You can give several options on a single line, or put each one on a separate line. You can put the options in your scripts by preceding with #SBATCH, or give them as command-line arguments to the sbatch command.

#SBATCH -p standard           partition or queue to run
#SBATCH -c 4                  number of cpus per task, for a multithreaded job
#SBATCH -n 6                  number of tasks, for an MPI job
#SBATCH --mem-per-cpu=1gb     memory per core required
#SBATCH -t 0-01:00:00         walltime D-HH:MM:SS (here, one hour)
#SBATCH -J my_job             Name of your job
#SBATCH -o my_output%j        File for standard out - here the %j will be replaced by the JobID
#SBATCH -e my_error%j         File for standard error.  If not specified will go to same file as standard out.
#SBATCH --mail-type=begin     When to send e-mails pertaining to your job.  Can be any of [begin, end, fail, requeue, or all]
#SBATCH --mail-user=email     use another email address instead of the one in your ~/.forward file
#SBATCH --gres=gpu:1          requests 1 or 2 (--gres=gpu:2) gpu coprocessors per node (requires you select the gpu or gpu-debug partition with -p gpu or -p gpu-debug).
#SBATCH --gres=mic:1          requests 1 or 2 (--gres=mic:2) Intel PHI coprocessors per node (requires you select the phi partition with -p phi).
#SBATCH --reservation=RName   requests reservation named "Rname"

Instead of specifying the number of tasks and the memory per cpu, you can specify the requirements per node and the number of nodes using

#SBATCH -N 1                  Number of nodes.  
#SBATCH --mem=24gb            Memory required per node (you can give MB or GB - if omitted, MB)
#SBATCH --ntasks-per-node=24  Number of tasks per node.

You should not overspecify the number of tasks by using both -n and the combination of -N and --ntasks-per-node or overspecify the memory required by using both --mem and -mem-per-cpu

Sample Serial Program

This program will use a single core and 1 MB of memory to run a simple serial program. Note the -c=1 and -n=1 options are the default and are not necessary to specify in the SBATCH script.

#!/bin/bash
#SBATCH -J my_jobname
#SBATCH -o my_output_%j
#SBATCH --mem-per-cpu=1MB
#SBATCH -t 10:00:00
#SBATCH -n 1 
#SBATCH -c 1
my.serial.program

Sample MPI Program

Here is an example of a script for running an MPI program on 2 nodes (48 cores) with a job name, a designated output file, a time request of 10 hours, and a memory request of 1 MB per CPU (48 MB total)

#!/bin/bash
#SBATCH -J my_jobname
#SBATCH -o my_output_%j
#SBATCH --mem-per-cpu=1MB
#SBATCH -t 10:00:00
#SBATCH -N 2 
#SBATCH --ntasks-per-node=24
module load openmpi/1.6.5/b2 # should be whichever MPI was used to compile the program
mpirun myprogram

Sample OpenMP Program

Here is an example of a script for running a multithreaded (OpenMP) program on a single node (24 cores)

#!/bin/bash
#SBATCH -J my_jobname
#SBATCH -o my_output_%j
#SBATCH --mem-per-cpu=1MB
#SBATCH -t 10:00:00
#SBATCH -n 1
#SBATCH -c 24
export OMP_NUM_THREADS=24
my.omp.program

Sample MPI/OpenMP program

Here is a script for a hybrid MPI/OpenMP program to run 2 tasks and 24 threads per task, on two nodes (48 cores)

#!/bin/bash
#SBATCH -J my_jobname
#SBATCH -o my_output_%j
#SBATCH --mem-per-cpu=1MB
#SBATCH -t 10:00:00
#SBATCH -N 2 
#SBATCH --ntasks-per-node=1 
#SBATCH -c 24
export OMP_NUM_THREADS=24
mpirun my.mpi.omp.program

Job arrays

To submit several jobs that are (almost) identical, you can use a job array. For example,

#!/bin/bash
#SBATCH -o out.%a.txt -t 00:05:00
#SBATCH -a 1-10
echo This is job $SLURM_ARRAY_TASK_ID

If this script is submitted with sbatch, 10 jobs will actually be submitted. The job will create 10 output files, out.1.txt through out.10.txt, containing the line This is job 1 through This is job 10 respectively. Both the %a in the output filename and the variable $SLURM_ARRAY_TASK_ID will be replaced by the job array index. In practice, job arrays can be used to run many similar jobs with different input files whose names contain the array index, or the array index could be passed as a command-line argument to the program.

Note that there is a total limit of 2000 jobs in the queue (either running or pending). So if a very large number of jobs need to be run, it is probably best to use a combination of a job array, and a loop inside the job script itself. For instance, if a 5000 jobs need to be run (say a program that reads input parameters, and 5000 different sets of parameters are to be used), the input parameters could be placed in files in.1.1 through in.50.100, and the following script used:

#!/bin/bash
#SBATCH -o log.txt -t 00:05:00
#SBATCH -a 1-50
for i in {1..100}; do
  ip=$SLURM_ARRAY_TASK_ID.$i
  myprogram in.$ip > out.$ip
done

This will run 50 separate jobs, each of which calls myprogram 100 times with a different set of parameters.

Dependencies

Sometimes you need to run jobs in serial, so that job B doesn't start until job A has completed.
You can do this with the --dependency option. For instance,

sbatch A.sh

which will submit A.sh and give you the JobId, for example

Submitted batch job 2643976

Then you can type

sbatch --dependency=afterok:2643976 B.sh

In this way, the job in B.sh will not run until the job in A.sh has completed successfully.

Running a pipeline of jobs with dependencies

The script sbatch-pipeline will submit a pipeline of several jobs, each of which might depend on previous jobs finishing successfully.

bash-4.1$ sbatch-pipeline --help
usage: sbatch-pipeline [InFile]
Submit a pipeline of jobs to sbatch, with dependencies
Input is a series of lines. 
Jobs on each line will not run until jobs on the previous line have finished successfully.
Jobs on the same line can run at the same time.
Jobs can be arrays.  Subsequent jobs will wait for all jobs in the array to finish.
If no InFile is given, input will be taken from standard in.
Example: if a,b,c,d,e,f are sbatch scripts, and InFile contains the lines
a b
c
d e f
then running 
sbatch-pipeline InFile 
will echo and run the following sbatch commands
sbatch a
Submitted batch job 7515617
sbatch b
Submitted batch job 7515618
sbatch --dependency=afterok:7515617:7515618 c
Submitted batch job 7515619
sbatch --dependency=afterok:7515619 d
Submitted batch job 7515620
sbatch --dependency=afterok:7515619 e
Submitted batch job 7515621
sbatch --dependency=afterok:7515619 f
Submitted batch job 7515622

Local scratch storage

Each compute node has its own local disk. When a job starts, the directory /local_scratch/$SLURM_JOB_ID is created on that disk, and removed when the job ends. That directory may be used for temporary files needed during a job. Any output must be copied back to /scratch or /home before a job completes.

Additional information

Type man sbatch, man squeue, or man scancel