Slurm node state control

Here’s the thing. You reboot the cluster and everything looks fine, but your nodes are DOWN.  sinfo shows something like this:

root@beta ~ ## > sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 idle bbeta
debug* up infinite 4 down node[105-108]

Restarting the controller and the demons

root@beta ~ ## > systemctl restart slurmctld.service 
root@node108 ~ ## > systemctl restart slurmd.service

doesn’t help. The logs look fine, also:

root@node108 ~ ## > tail -50 /var/log/SlurmdLogFile.log
[date] Resource spec: some message
[date] slurmd version 16.05.4 started
[date] slurmd started on 'date'
[date] CPUs=40 Boards=1 Sockets=40 Cores=1 Threads=1 Memory=773610 TmpDisk=111509 Uptime=771 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

So what’s it? It turns out simpler than expected. I found out the solution on the slurm developer google forum. I quote:

The cold-start of slurm means the control daemon has no way to
distinguish a down node from one that is just powered off.You can use the scontrol command to change node states if desired.
Note that setting a node state to power_down changes the state in
the internal tables, but does not execute the power down script.
scontrol update nodename=”cn[01-70]” state=resume
scontrol update nodename=”cn[01-70]” state=power_down

So by typing:

scontrol update nodename=node107 state=IDLE

I magically have all my nodes back :-). We can tune also how they behave. If we want them to accept multiple jobs (good for big nodes) we need to configure SelectType=select/cons_res. You can also have exclusivity by selection SelectType=select/linear. Found here. So the piece of our slurm.conf file looks like:

# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

I want also to list the possible node errors (visible on the slurmd log file). I will name this one: missing plugin error

[date :reference] error: we don't have select plugin type 101
[date :reference] error: select_g_select_jobinfo_unpack: unpack error
[date :reference] error: Malformed RPC of 
type REQUEST_TERMINATE_JOB(6011) received
[date :reference] error: slurm_receive_msg_and_forward: 
Header lengths are longer than data received
[date :reference]error: service_connection: 
slurm_receive_msg: Header lengths are longer than data received

This is caused by an outdated slurm.conf. Copy it from the login node, and restart slurmd solves the problem. We have also the missing uid error

[date:reference] Launching batch job 27 for UID 3201
[date:reference] error: _send_slurmstepd_init: getpwuid_r: No error
[date:reference] error: Unable to init slurmstepd
[date:reference] error: uid 3201 not found on system
[date:reference] error: batch_stepd_step_rec_create() failed: User not found on host
[date:reference] error: _step_setup: no job returned
[date:reference] error: Unable to send "fail" to slurmd
[date:reference] done with job

The error is clear here. Check that the sssd demon runs, and that the keytabs are right and installed, and problem gone.

Advertisements

About bitsanddragons

A traveller, an IT professional and a casual writer
This entry was posted in bits, centos, linux, slurm. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s