Wandering

The echelon was not the reason that brought me to the cave. I moved down the volume of my glasses but I kept them on, in case some of the meta-info displayed floating around was really helpful. The place was relatively dark, so the power to call some holograms could help to find the way out. Where the hell was the Golden Rabbit? I obviously asked my glasses before, and my glasses answered what you expect when you ask for something not registered, or registered but systematically and carefully removed. I decided to try on the opposite wall. There was a kind of roundabout around the echelon, surrounded by a small natural wall (a crater wall?) with exits also to my left and to my right, as pointed by my glasses. My sense of orientation was telling me that to the right, more or less, were the Lower Lands, were it was easy to get something to drink, it doesn’t matter what you were looking for. I was thirsty after the walk from the centre, so I chose that direction. In addition, it was logical to think that, if the exit of the cave was indeed connecting with the Lower Lands,  the chances to find a pub were higher on that direction than to my left. An in a pub, more people could maybe give me directions. On the other hand, drunken people mean problems, here and everywhere.

I sighed, looking again to the echelon, and started walking down the path.

Advertisements

GPU fever

I have 10 NVIDIA GeForce GTX 1080 8GB that I need to distribute around my servers. The more we have in one machine, the better. The problem I have now is weird: for some reason, the motherboard of some of my DELL servers is not detecting a fully plugged (8xPIN connector and all) GPU. This in not at the level of the OS, since I don’t see any illuminated logo or rotating fans once the GPU is plugged. Hell, it doesn’t even list on “lspci”. Some tuning is needed.

We have an “old” version of the servers to play with. I start by installing a clean desktop version of my OS of choice, Centos7, no GPU inside.

The island

The bamboo bridge was looking so fragile that I was afraid of stepping over. My boss, a small Japanese man, was walking confidently, with soft naked steps, through his bridge to the right, without looking back. After a moment of hesitation I did the same. The path to the first island was a hundred meters long. Due to hot fumes and vegetation close to the shore, I was not able to see if my destination was populated or not. Also the size of it was not clear. Beyond it, I distinguished another island, with one or two palm trees. I looked to the dark water under the bridge. Something was creating waves. Where there carps? In a previous visit to a Japanese garden, I saw the size those animals can reach, and the memory made me feel uneasy about mixing with the visitors of the facility. But maybe I was being naive. Maybe there was some kind of fence underwater, that was not letting the fishes go into the bath area.

Suddenly I was there. I hesitated. The sand of the shore was slightly warm, as if heated from below. The plants on both sides, some kind of leafy bamboo trees, were perfectly cut to reach the height of my waist. Should I get rid of the yukata? There was nobody around to ask, so I voted against. Doubtful, I crossed the vegetation to reach the heart of the island and looked around. She was in a little clearance to my left, another fifty meters away or so, laying naked.

Connectivity problems

Sometimes life sucks. I went over my handy data plan, therefore I’m not anymore able to update from the mobile phone. In addition, our internet provider (or is it our router?) is starting to fail like hell. I think it needs a replacement (the internet provider). Unfortunately, things are easy to say but not easy to achieve when you live in another country. Sometimes they are not easy to achieve even if you live on your own country.

Long story short (I love the expression!) I will probably not write non-IT entries for a week or so, unless my job queue goes drastically down. And this is never happening. Let me know if this is pissing you off a lot, random reader.

EDIT: It looks like I found out a moment of peace to write each day a little non-IT stuff. We’ll see how long this situation lasts.

Slurm and relion usage

So all in our cluster with slurm queuing system seems to be up and running. What’s the user experience? On our job script, we need to take care of a few VIP parameters:

#SBATCH --nodes=2
#SBATCH --ntasks=40
#SBATCH --ntasks-per-node=40
#SBATCH --cpus-per-task=2
#SBATCH --overcommit

The parameters are self explanatory. Maybe the only one that needs some explanation is the cpus-per-task. Setting it up to 2, will make the CPU run two threads, allowing it to reach 200%. Our relion command line looks something like

mpirun -np 80 relion_refine_mpi --o Refine3D/run_parallel_nice 
--auto_refine --split_random_halves --i particles.star 350 
--angpix 1.77 --ref Refine3D/run-03-08-14_class001.mrc 
--ini_high 60 --ctf --ctf_corrected_ref --flatten_solvent 
--zero_mask --oversampling 1 --healpix_order 2 
--auto_local_healpix_order 4 --offset_range 5 
--offset_step 2 --sym C1 --low_resol_join_halves 40 
--norm --scale --j

We have 8 nodes, with 40 CPUs. If I want to launch 80 processes, we need to use at least 2 nodes. We can write it this way

#SBATCH --nodes=2
#SBATCH --ntasks=80

or this way

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=40

What’s best? Unfortunately in this case, it is not so important. But if you ask for 2 nodes, its state goes to allocated or mixed depending of

#SBATCH --overcommit

So here we have a sample sinfo -lN

Mon Aug 29 16:48:33 2016
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK 
WEIGHT AVAIL_FE REASON 
node101 1 debug* allocated 40 2:20:1 1 0 1 (null) none 
node102 1 debug* allocated 40 2:20:1 1 0 1 (null) none 
node103 1 debug* mixed 40 2:20:1 1 0 1 (null) none 
node104 1 debug* mixed 40 2:20:1 1 0 1 (null) none 
node105 1 debug* idle 40 2:20:1 1 0 1 (null) none 
node106 1 debug* idle 40 2:20:1 1 0 1 (null) none 
node107 1 debug* mixed 40 2:20:1 1 0 1 (null) none 
node108 1 debug* mixed 40 2:20:1 1 0 1 (null) none

And the corresponding squeue:

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
 70   debug mpirun.0 pepito R 6:27 4 sbnode[103-104,107-108]
 71   debug Class2D juanito R 2:01 2 sbnode[101-102]

So in this example, user pepito runs on 4 nodes, with overcommitment. User juanito just asked for 80 processes in total. Allocation of resources is crucial. If you don’t make your math, you will end up with an  mpi not enough slots available error.

The echelon

I put down again my glasses. “This is the echelon” the voice said.” It was found by the first explorers of the caves, more than a century ago. Sorry, you can’t come closer, we don’t know yet if it’s safe”. I walked back the path. And they will not know only by looking passively to it. “Some people claim it’s an alien artifact, but the official definition is ‘crystaline meteoric object of unknown origin’. As you can appreciate, it looks like the half of it is still under the cave ground, while the upper tip just touches the ceiling. That’s why, from this distance, it resembles a black, conical tower. What astonished more the first explorers, apart from the odd resemblance, is the lack of any entrance hole, like if the whole thing (estimated 2 km in length) was teleported there. ” I looked to the ceiling. The cusp of the object was well illuminated by powerful lights. The top of our chamber appeared indeed not wounded at all. ” So how it arrived here? The accepted explanation is simpler than what you think: it was there when the cave was formed. Unfortunately we didn’t manage to confirm this theory, since the usual dating techniques refused to work for the echelon.”

Slurm node state control

Here’s the thing. You reboot the cluster and everything looks fine, but your nodes are DOWN.  sinfo shows something like this:

root@beta ~ ## > sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 idle bbeta
debug* up infinite 4 down node[105-108]

Restarting the controller and the demons

root@beta ~ ## > systemctl restart slurmctld.service 
root@node108 ~ ## > systemctl restart slurmd.service

doesn’t help. The logs look fine, also:

root@node108 ~ ## > tail -50 /var/log/SlurmdLogFile.log
[date] Resource spec: some message
[date] slurmd version 16.05.4 started
[date] slurmd started on 'date'
[date] CPUs=40 Boards=1 Sockets=40 Cores=1 Threads=1 Memory=773610 TmpDisk=111509 Uptime=771 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

So what’s it? It turns out simpler than expected. I found out the solution on the slurm developer google forum. I quote:

The cold-start of slurm means the control daemon has no way to
distinguish a down node from one that is just powered off.You can use the scontrol command to change node states if desired.
Note that setting a node state to power_down changes the state in
the internal tables, but does not execute the power down script.
scontrol update nodename=”cn[01-70]” state=resume
scontrol update nodename=”cn[01-70]” state=power_down

So by typing:

scontrol update nodename=node107 state=IDLE

I magically have all my nodes back :-). We can tune also how they behave. If we want them to accept multiple jobs (good for big nodes) we need to configure SelectType=select/cons_res. You can also have exclusivity by selection SelectType=select/linear. Found here. So the piece of our slurm.conf file looks like:

# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

I want also to list the possible node errors (visible on the slurmd log file). I will name this one: missing plugin error

[date :reference] error: we don't have select plugin type 101
[date :reference] error: select_g_select_jobinfo_unpack: unpack error
[date :reference] error: Malformed RPC of 
type REQUEST_TERMINATE_JOB(6011) received
[date :reference] error: slurm_receive_msg_and_forward: 
Header lengths are longer than data received
[date :reference]error: service_connection: 
slurm_receive_msg: Header lengths are longer than data received

This is caused by an outdated slurm.conf. Copy it from the login node, and restart slurmd solves the problem. We have also the missing uid error

[date:reference] Launching batch job 27 for UID 3201
[date:reference] error: _send_slurmstepd_init: getpwuid_r: No error
[date:reference] error: Unable to init slurmstepd
[date:reference] error: uid 3201 not found on system
[date:reference] error: batch_stepd_step_rec_create() failed: User not found on host
[date:reference] error: _step_setup: no job returned
[date:reference] error: Unable to send "fail" to slurmd
[date:reference] done with job

The error is clear here. Check that the sssd demon runs, and that the keytabs are right and installed, and problem gone.