Here’s the thing. You reboot the cluster and everything looks fine, but your nodes are DOWN. sinfo shows something like this:
root@beta ~ ## > sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 1 idle bbeta debug* up infinite 4 down node[105-108]
Restarting the controller and the demons
root@beta ~ ## > systemctl restart slurmctld.service root@node108 ~ ## > systemctl restart slurmd.service
doesn’t help. The logs look fine, also:
root@node108 ~ ## > tail -50 /var/log/SlurmdLogFile.log [date] Resource spec: some message [date] slurmd version 16.05.4 started [date] slurmd started on 'date' [date] CPUs=40 Boards=1 Sockets=40 Cores=1 Threads=1 Memory=773610 TmpDisk=111509 Uptime=771 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
So what’s it? It turns out simpler than expected. I found out the solution on the slurm developer google forum. I quote:
The cold-start of slurm means the control daemon has no way to
distinguish a down node from one that is just powered off.You can use the scontrol command to change node states if desired.
Note that setting a node state to power_down changes the state in
the internal tables, but does not execute the power down script.
scontrol update nodename=”cn[01-70]” state=resume
scontrol update nodename=”cn[01-70]” state=power_down
So by typing:
scontrol update nodename=node107 state=IDLE
I magically have all my nodes back :-). We can tune also how they behave. If we want them to accept multiple jobs (good for big nodes) we need to configure SelectType=select/cons_res. You can also have exclusivity by selection SelectType=select/linear. Found here. So the piece of our slurm.conf file looks like:
# SCHEDULING FastSchedule=1 SchedulerType=sched/backfill #SchedulerPort=7321 SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory
I want also to list the possible node errors (visible on the slurmd log file). I will name this one: missing plugin error
[date :reference] error: we don't have select plugin type 101 [date :reference] error: select_g_select_jobinfo_unpack: unpack error [date :reference] error: Malformed RPC of type REQUEST_TERMINATE_JOB(6011) received [date :reference] error: slurm_receive_msg_and_forward: Header lengths are longer than data received [date :reference]error: service_connection: slurm_receive_msg: Header lengths are longer than data received
This is caused by an outdated slurm.conf. Copy it from the login node, and restart slurmd solves the problem. We have also the missing uid error
[date:reference] Launching batch job 27 for UID 3201 [date:reference] error: _send_slurmstepd_init: getpwuid_r: No error [date:reference] error: Unable to init slurmstepd [date:reference] error: uid 3201 not found on system [date:reference] error: batch_stepd_step_rec_create() failed: User not found on host [date:reference] error: _step_setup: no job returned [date:reference] error: Unable to send "fail" to slurmd [date:reference] done with job
The error is clear here. Check that the sssd demon runs, and that the keytabs are right and installed, and problem gone.