Running batch jobs

Submit jobs via the SLURM queueing system

This is the preferred method of submitting batch jobs to the cluster queueing system and to run jobs interactively.


Available queues:

Tailored to the needs of running medium to large production runs, on one hand, and having a high turnaround for debugging tasks and short jobs, on the other hand, the cluster queueing system provides five different job queues.

PARTITION   TIMELIMIT    NODES
astro2      10-00:00:00    32
astro_gpu   10-00:00:00    30
astro_phi       8:00:00    39
astro_devel     2:00:00   139
astro_short    12:00:00   137
astro_long   5-00:00:00   100
astro_fe        6:00:00     4

The astro2 nodes are part of the old cluster, and you may want to avoid using them. They may also at some point disappear from the configuration. The queues astro_devel, astro_short, and astro_long share the same pool of nodes (currently 139). They are different in that they have different time limits (see the list on the left) and queueing priorities. The partition astro_fe serves the purpose of running jobs on the login nodes (front end). The astro_gpu queue is used to access the GPU nodes.


Most important queue commands:

Here we list the most commonly used queueing commands. If you are migrating from a different scheduling system, this cheat sheet may be useful for you. There also exists a compact two-page overview of the most important commands.

Use the 'sinfo' command to display information about available resources:

If you use the command without any options, it will display all available partitions. Use the -p switch to select a specific partition, for instance:

astro06:> sinfo -p astro2
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
astro2 up 10-00:00:0 1 down* node458
astro2 up 10-00:00:0 13 alloc node[454-457,459-462,480-481]
astro2 up 10-00:00:0 18 idle node[463-479,482]

The command displays how many nodes in the partition are offline (down), are busy (alloc) and how many are still available (idle). For each sub-category, a NODELIST is displayed. The TIMELIMIT column shows the maximum job duration allowed for the partition in days-hours:minutes:seconds format. You can find more information about how to use the sinfo command on the official SLURM man pages.

Use the 'squeue' command to display information about scheduled jobs:

astro06:> squeue
 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 566136    astro2 jobname1 username  R      47:22      1 node485
 566135    astro2 jobname2 username  R      54:22      1 node481

The command displays a table with useful information. Use the JOBID of your job to modify or cancel a scheduled or already running job (see below). The status column(ST) shows the state of the queued job, the letters stand for: PD (pending), R (running), CA (cancelled), CG (completing), CD (completed), F (failed), TO (timeout), and NF (node failure). Useful command line switches for squeue include -u (or --users) for only listing jobs that belong to a specific user. You can find more information about how to use the squeue command on the official SLURM man pages.

Use the 'scancel' command to cancel a scheduled or running job:

astro06:> scancel 566136

You can find more information about how to use the scancel command on the official SLURM man pages.

Use the 'srun' command to run jobs interacitvely:

You can run serial, openMP- or MPI-parallel code interactively using the srun command. Always make sure to specify the partition to run on via the -p command line switch. When running an MPI job, you can use the -n switch to specify the number of MPI tasks that you require. Command line arguments for your program can be passed at the end.

astro06:> srun -p astro_devel -n 20 <executable> [args...]

You can find more information about how to use the srun command on the official SLURM man pages.

Use the 'sbatch' command to queue a job via a submission script:

astro06:> sbatch [additional options] job-submission-script.sh

You can find more information about how to use the sbatch command on the official SLURM man pages.


Simple batch script for running serial programs:

Submission scripts are really just shell scripts (here we use bash syntax) with a few additional variable specifications at the beginning. These are prefixed with "#SBATCH" and otherwise use the same keywords and syntax as the command line options described for the sbatch command. The following script presents a minimal example:

#!/bin/bash
#SBATCH --job-name=isothermal # shows up in the output of 'squeue'
#SBATCH --partition=astro_devel # specify the partition to run on
srun /bin/echo "Hello World!"

Specifying mail notifications and manage output and error files:

You can enable e-Mail notification on various events, this can be specified via the --mail-type option which takes the following values: BEGIN, END, FAIL, REQUEUE, and ALL.

#SBATCH --mail-type=FAIL
#SBATCH --output=/astro/username/que/job.%J.out
#SBATCH --error=/astro/username/que/job.%J.err

The standard output and error can be written to specific files with the --output and --error commands. By default, both files are directed to a file of the name slurm-%j.out, where the %j is replaced with the job number.


Simple batch script for running MPI-parallel jobs:

Note: The following example script assumes that you submit the script from the directory where the code will be executed (the path to that directory is stored in the $SLURM_SUBMIT_DIR environment variable, and is where SLURM will execute the script from).

#!/bin/bash
#
# SLURM resource specifications
# (use an extra '#' in front of SBATCH to comment-out any unused options)
#
#SBATCH --job-name=isothermal # shows up in the output of 'squeue'
#SBATCH --time=4-23:59:59 # specify the requested wall-time
#SBATCH --partition=astro_long # specify the partition to run on
#SBATCH --nodes=4 # number of nodes allocated for this job
#SBATCH --ntasks-per-node=20 # number of MPI ranks per node
#SBATCH --cpus-per-task=1 # number of OpenMP threads per MPI rank
##SBATCH --exclude=<node list> # avoid nodes (e.g. --exclude=node786)

# Load default settings for environment variables
source /users/software/astro/startup.d/modules.sh

# If required, replace specific modules
# module unload intelmpi
# module load mvapich2

# When compiling remember to use the same environment and modules

# Execute the code
srun <executable> [args...]

Simple batch script for jobs using OpenMP only:

Note: The following example script assumes that you submit the script from the directory where the code will be executed (the path to that directory is stored in the $SLURM_SUBMIT_DIR environment variable).

Note: By choosing --cpus-per-task=40 along with --ntasks-per-core=2, you have assumed that your program will take advantage of hyper threading. If this is not the case, or you are uncertain, use --cpus-per-task=20 along with --ntasks-per-core=1, and your program will be executed with 20 threads.

#!/bin/bash
#
# Define SLURM resource specifications
# (use an extra '#' in front of SBATCH to comment-out any unused options)
#
#SBATCH --job-name=isothermal # shows up in the output of 'squeue'
#SBATCH --time=4-23:59:59 # specify the requested wall-time
#SBATCH --partition=astro_long # specify the partition to run on
#SBATCH --cpus-per-task=40 # total number of threads
#SBATCH --ntasks-per-core=2 # threads per core (hyper-threading)
##SBATCH --exclude=<node list> # avoid nodes (e.g. --exclude=node786)

# Load default settings for environment variables
source /users/software/astro/startup.d/modules.sh

# OpenMP affinity
# no hyperthreading
# export KMP_AFFINITY="granularity=core,scatter,1,0"
# hyperthreading
export KMP_AFFINITY="granularity=thread,scatter,1,0"

# When compiling remember to use the same environment

# Execute the code
srun --cpu_bind=threads <executable> [args...]

Note: When compiling your code with the -L/usr/lib64/libslurm.so -lpmi linking options, you do not have to worry about explicitly saying export OMP_NUM_THREADS=20 when setting-up the environment variables.


Hybrid MPI + OpenMP batch script:

Note: The following example script assumes that you submit the script from the directory where the code will be executed (the path to that directory is stored in the $SLURM_SUBMIT_DIR environment variable, and is where SLURM will execute the script from).

#!/bin/bash
#
# Define SLURM resource specifications
# (use an extra '#' in front of SBATCH to comment-out any unused options)
#
#SBATCH --job-name=isothermal # shows up in the output of 'squeue'
#SBATCH --time=4-23:59:59 # specify the requested wall-time
#SBATCH --partition=astro_long # specify the partition to run on
#SBATCH --nodes=32 # number of nodes allocated for this job
#SBATCH --ntasks-per-node=8 # lower than the usual 20 for MPI only
#SBATCH --cpus-per-task=5 # number of CPUs per MPI rank
#SBATCH --ntasks-per-core=2 # threads per core (hyper-threading)
##SBATCH --exclude=<node list> # avoid nodes (e.g. --exclude=node786)

# Load default settings for environment variables
source /users/software/astro/startup.d/modules.sh

# OpenMP affinity
export KMP_AFFINITY="granularity=thread,scatter,1,0"

# If required, replace specific modules
# module unload intelmpi
# module load mvapich2

# When compiling remember to use the same environment and modules

# Execute the code
cd $SLURM_SUBMIT_DIR
srun --cpu_bind=threads <executable> [args...]

Running jobs on the astro_gpu nodes:

The astro_gpu nodes can be used in a number of ways:

  • As CPU-only nodes: this can be done exactly like any of the other nodes, using for example the combination of the intel compilers and intel MPI. Note that these nodes have westmere cores, and do not support AVX instructions. If the code has been compiled on astro06-09 with the -xHost flag, it will fail.
  • Using CUDA: Load the cuda module; link with either intelMPI or MVAPICH2 to make MPI+Cuda programs. Currently there are no active users using Cuda directly. You are on your own. Cuda up to version 7.0 is supported:
 module load cuda/7.0
  • Using the PGI compiler and either PGI fortran og OpenACC to exploit the GPUs:
module unload intel intelmpi
module load pgi mvapich2

Given the hardware and compiler setup, a good choice of PGI compiler flags for optimal performance with CUDA-Fortran are

-fastsse -Mcuda=5.5,cc13,cc20 -tp nehalem-64 -Mvect=simd:128 -mp

Here is a sample script for executing a code compiled with PGI and MVAPIC2. It is using a hybrid (MPI+OpenMP) layout with 2 CPU cores / 4 threads assigned to each GPU and active hyper threading:

#!/bin/bash
#
# Define SLURM resource specifications
# (use an extra '#' in front of SBATCH to comment-out any unused options)
#
#SBATCH --job-name=pgi-astro_gpu # shows up in the output of 'squeue'
#SBATCH --time=4-23:59:59 # specify the requested wall-time
#SBATCH --partition=astro_gpu # specify the partition to run on
#SBATCH --nodes=30 # use the full partition
#SBATCH --gres gpu:4 # reserve all 4 GPUs on each node
#SBATCH --ntasks-per-node=4 # equal to the number of GPUs
#SBATCH --cpus-per-task=4 # number of CPU threads per MPI rank
#SBATCH --ntasks-per-core=2 # threads per core (hyper-threading)
##SBATCH --exclude=<node list> # avoid nodes (e.g. --exclude=node786)

# Load default settings for environment variables
source /users/software/astro/startup.d/modules.sh

module unload intelmpi intel
module load pgi mvapich2/pgi

# When compiling remember to use the same environment and modules

# Execute the code
srun --cpu_bind=threads <executable> [args...]

Running jobs on Xeon-Phi boards:

See here for detailed information.