Adding nodes to slurm

If you ever asked yourself if your jobs are stopped when you add a new node to SLURM, I have the answer. NO, your jobs are not. This is how I did id.

  1. Be sure that on the new nodes (here node109, node110) have the same id for munge and slurm users(id munge, id slurm)
  2. Be sure you can munge and remunge from the login node to the new nodes (simply do munge -n | ssh node110 unmunge )
  3. Add the nodes to the slurm.conf on the login node
  4. Update the slurm.conf everywhere (for i in list_of_nodes, etc will do it)
  5. Restart the slurm daemon  everywhere (for in in list_of_nodes; do ssh $i systemctl restart slurmd.service; done)
  6. Restart the slurm controler daemon on the login node  (systemctl restart slurmctld.service)
  7. Check that the nodes are visible for an user (sinfo -lN) and that you can submit a job to them

Happy slurming !!! (???)

Advertisements

About bitsanddragons

A traveller, an IT professional and a casual writer
This entry was posted in bits, centos, linux, slurm. Bookmark the permalink.

4 Responses to Adding nodes to slurm

  1. Rolly Ng says:

    Hi, thank you for the tips on SLURM and I found them very helpful. I would like to know if I am having SLURM jobs running on the cluster, is it safe to restart the slurmd.service on the compute nodes and slurmctld.service on the login node? Am I going to loss the running jobs? Thank you.

    Like

    • My experience is that you don’t lose the running jobs. After the restart, the jobs will reappear on the queue. They seem to be pretty solid daemons, both. We do it all the time to update the node list… The procedure I think it’s anyway coming with a little risk…maybe it depends on what you run. For example, relion is sometimes, for some jobs, not coming back after the restart of the slurm daemons and the controller. Unfortunately I was not able to reproduce the behavior in a controlled environment… so I can’t tell you if it was a relion design flaw or slurm. Let me know how it was 🙂

      Like

      • Rolly Ng says:

        Hi,
        It is so wonderful to have your prompt reply. We are running Quantum Espresso v.5.3.0 on the cluster. I have issue when I restart slrumd.service on the compute node running pw.x. It may be due to /dev/nvidia0 not found. I need to run nvidia-smi on the compute node to show the /dev/nvidiaX, may be I missed some auto loading script for the nvidia driver. So, the job crashed… But restart slurmctld.service appears safe. Thank you!
        Besides… may I have your advise know to auto load the nvidia driver on booting the compute node? We are running CentOS 7.1, Thanks again

        Like

  2. I’m sorry it didn’t work, and thanks for the tip on the problems with the gpu drivers.

    Actually I had no problem so far with the nvidia drivers, they come back smoothly after a reboot, also after a slurm daemon restart. It is true we try always to have the latest version of the drivers, an “homogeneous” system (all GPUs the same type) and I install them using the NVIDIA provided executable. It’s called something like *NVIDIA-Linux-x86_64-384.98.run* … you need to run it, and it will compile and install the drivers for you, so they will be loaded on boot. The test that everything is OK for me is that I can run *nvidia-smi* after a reboot, and I see all the cards.
    Please try updating the drivers this way next time you need to 🙂

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s