Slurm seems to be a hot topic, and a lot of people arrive here to check it out. So I feel like by demand I need to post about a little bit more. To the beginners I’m going to save some search and send them to my first post Slurm on CentOS 7. Now that you know how to install it, and you (may) even have a basic partition system working, let’s discuss some aspects the slurm.conf file. For now one, please remember to format the file properly, since I just show you my notes, not working solutions.
The hard core of slurm is the resource management. Our resources are the nodes. Let’s say we want to add 3 nodes, with different features, but consecutive names. The obvious approach is to list them. Like this.
# COMPUTE NODES NodeName=node101 NodeAddr=192.168.1.1 CPUs=20 State=UNKNOWN NodeName=node102 NodeAddr=192.168.1.2 CPUs=40 State=UNKNOWN NodeName=node103 NodeAddr=192.168.1.3 CPUs=30 State=UNKNOWN # PARTITION NAMES PartitionName=test Nodes=node101,node102,node103 Default=YES MaxTime=INFINITE State=UP
We can also declare them in a compact way.
# COMPUTE NODES NodeName=node[101-103] NodeAddr=192.168.1.[1-3] State=UNKNOWN # PARTITION NAMES PartitionName=debug Nodes=node1[01-03] MaxTime=INFINITE State=UP
Both ways leave slurm to manage more or less freely the node resources. This ends up on a node-locking configuration. If you submit a job, in principle, it will go to one node, and run there. The node will be then marked as allocated, and no more jobs will be allowed to go to the machine. Event if there are free resources available! So what to do instead? I suppose you have some knowledge of your hardware, so the answer is to declare the resources per node. You can still use the compact notation. Let’s say we want to add now two more nodes nodea,nodeb with the same hardware, that includes two GPUs per server. Our slurm.conf configuration should look very similar to the next.
# SCHEDULING FastSchedule=1 SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_Core DefMemPerCPU=8192 GresTypes=gpu # COMPUTE NODES NodeName=nodea,nodeb RealMemory=515000 Sockets=4 CoresPerSocket=8 ThreadsPerCore=1 Feature=gpu,opteron Weight=10 State=UNKNOWN Gres=gpu:TXP:2 # PARTITION NAMES PartitionName=ALL Nodes=nodea,nodeb,node1[01-03] Default=YES MaxTime=INFINITE State=UP
What is this gres thing? In order to declare and share GPU resources, we need to add a file to the slurm configuration on /etc/slurm/. This file is the gres.conf. It looks like this:
Name=gpu Type=TXP File=/dev/nvidia0 Name=gpu Type=TXP File=/dev/nvidia1
As you see, what you declare on the slurm.conf node definition is the “type” you have on the gres.conf. In our example, TXP=Titan XP. You can name it the way you like. After being sure the new slurm.conf is on our 5 nodes, and the gres.conf on our two nodes with GPUs, we restart all the daemons and the slurm controller.
Now we have a clear resource sharing, since slurm doesn’t have to figure out how to give you the resources. The problem is, if you launch a job, you need to specifically ask for the resources you want. For example, like this:
srun -n 40 –gres=gpu:TXP:2 –constraint=gpu –pty bash -login
The above command will give you a bash login on nodea and nodeb (the ones with Feature=gpu, that we put as a constraint) where you will have 40 cores and 2 GPUs. If another job is launched in this way, it should go to the pooled resources that are free at this moment, that is, the free CPUs from the 64 in total (4x8x2=64, therefore only 24) and the free GPUs. Note that, as pointed out on my Slurm user cheatsheet, you can also “cheat” the DefMemPerCPU on slurm.conf. On the above exaple, set up to 8 GB of RAM.
What else do I need to say about slurm now? I’ll tell you: how about making a safer install, where two servers run the controller? Yes, that will be a good one. And how about the job accounting features? It will be nice to know how to manage that also. So see you soon!