Slurm and relion usage

So all in our cluster with slurm queuing system seems to be up and running. What’s the user experience? On our job script, we need to take care of a few VIP parameters:

#SBATCH --nodes=2
#SBATCH --ntasks=40
#SBATCH --ntasks-per-node=40
#SBATCH --cpus-per-task=2
#SBATCH --overcommit

The parameters are self explanatory. Maybe the only one that needs some explanation is the cpus-per-task. Setting it up to 2, will make the CPU run two threads, allowing it to reach 200%. Our relion command line looks something like

mpirun -np 80 relion_refine_mpi --o Refine3D/run_parallel_nice 
--auto_refine --split_random_halves --i particles.star 350 
--angpix 1.77 --ref Refine3D/run-03-08-14_class001.mrc 
--ini_high 60 --ctf --ctf_corrected_ref --flatten_solvent 
--zero_mask --oversampling 1 --healpix_order 2 
--auto_local_healpix_order 4 --offset_range 5 
--offset_step 2 --sym C1 --low_resol_join_halves 40 
--norm --scale --j

We have 8 nodes, with 40 CPUs. If I want to launch 80 processes, we need to use at least 2 nodes. We can write it this way

#SBATCH --nodes=2
#SBATCH --ntasks=80

or this way

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=40

What’s best? Unfortunately in this case, it is not so important. But if you ask for 2 nodes, its state goes to allocated or mixed depending of

#SBATCH --overcommit

So here we have a sample sinfo -lN

Mon Aug 29 16:48:33 2016
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK 
WEIGHT AVAIL_FE REASON 
node101 1 debug* allocated 40 2:20:1 1 0 1 (null) none 
node102 1 debug* allocated 40 2:20:1 1 0 1 (null) none 
node103 1 debug* mixed 40 2:20:1 1 0 1 (null) none 
node104 1 debug* mixed 40 2:20:1 1 0 1 (null) none 
node105 1 debug* idle 40 2:20:1 1 0 1 (null) none 
node106 1 debug* idle 40 2:20:1 1 0 1 (null) none 
node107 1 debug* mixed 40 2:20:1 1 0 1 (null) none 
node108 1 debug* mixed 40 2:20:1 1 0 1 (null) none

And the corresponding squeue:

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
 70   debug mpirun.0 pepito R 6:27 4 sbnode[103-104,107-108]
 71   debug Class2D juanito R 2:01 2 sbnode[101-102]

So in this example, user pepito runs on 4 nodes, with overcommitment. User juanito just asked for 80 processes in total. Allocation of resources is crucial. If you don’t make your math, you will end up with an  mpi not enough slots available error.

Advertisements

About bitsanddragons

A traveller, an IT professional and a casual writer
This entry was posted in bits, centos, linux, slurm. Bookmark the permalink.

One Response to Slurm and relion usage

  1. Pingback: Compiling relion-2.0-stable on CentOS 7 | Bits and Dragons

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s