If you have several Linux clients and no way to monitoring what’s going on munin is your friend. I have a couple of non-standard munin plugins that I’ve been scraping around so I was feeling brave enough to write a plugin myself. There are several blogs that tell you how to do that. Here you have the entry for python munin scripts. And here is the official one for python-munin. There’s also an official one for a shell munin plugin. This one is how to do one bash munin script, but this one also with a better example that I will take as a template. Do not forget there’s an official testing procedure also.With a corresponding short, for my taste, troubleshooting guide. I found also interesting this link to monitor mongodb with munin and this article from the Linux magazine.
Let’s not digress. What I want to do is monitor the gpu usage across my cluster from the login node. For this I need to prepare two scripts beforehand, one to read the local gpu activity, another one to do the math over ssh. This is how they look like:
#!/bin/bash # gpu_total # explanation: get the lines with % # print the fields corresponding to current use # format them, add them and print the sum nvidia-smi | grep "%" | awk '{print $13}' | tr -d '\n' \ | awk ' BEGIN { FS = "%"}; {for(i=1;i<=NF ;i++){sum+=$i}}; \ {print sum} '
The one above (gpu_total) is assuming you can run nvidia-smi on the node. If that is not the case, maybe you can cook up something similar. Now on the login node, I ssh to the gpu node and run the script. Something like this:
#!/bin/bash # gpu_sum ## do ssh to a client, perform there gpu_total sum=0 for i in a b c d e ; do local=`ssh $i gpu_total` sum=`expr $sum + $local` done echo $sum
My graphical clients are named here a b c d e. I guess you could call them in another way, so do not forget to change that. I test the scripts and they work on my network, giving me one number at the end, like this: 12345. Now, how do I hook up this info onto munin? This is my munin script, very similar to the second bash example I gave:
#! /bin/bash case $1 in config) cat << EOF graph_category Slurm graph_title GPU usage cumulative graph_vlabel GPU percent gpuuse.label GPU usage EOF exit 0 esac echo "gpuuse.value" `read_gpu_total`
Be aware of the formatting! If you don’t format it right, the configuration will not be loaded. If you want to know about munin data types, here’s the link.
Now the two-days-of-work question: what does read_gpu_total gives? If yu try to run the gpu_sum script directly, you will get an ssh error, since munin, in my case, is not having rights to do passwordless ssh to all our GPU servers. Also, the ssh and the calculation takes in my case around 30 seconds (I have a lot of gpus) so I will run a crontab that writes the number into a file. Let’s say in /tmp/gpus. Now I want the plugin to read the file, and draw the output. We read it like this:
cat /tmp/gpus | tail -1 | bc
Note that you could put that directly in the script. I load the script and wait 5 minutes, to get the error on the title. It looks like I was not the first one with the issue, and here you have another on sourceforge. None of them shed any light for me. What it is? How to fix it? If you check the node logs, the error is not very clear:
## > tail -50 /var/log/munin-node/munin-node.log some-date-here [10535] Error output from slurm_gpus: some-date-here [10535] cat: /tmp/gpus.txt: No such file or directory
Of course the file is there! I can check its content with more, and I can run the command on my root prompt. What the hell is going on? I will tell you. Munin can not read it! Change the ownership of the output file so that munin can read it. In my case,
-rw-r--r-- 1 nobody munin size Month Day HH:MM gpus
Finally, I can tell the boss we don’t need more GPUs, since we are on the average below 50% of total occupancy. Thanks munin for that, less work installing cards 😀
Bonus track: list of munin open issues.
Pingback: Adding and creating munin plugins | Bits and Dragons
Stumbled across this old post and have been writing some munin cmd:/// plugins.
The first 3 scripts above could be replaced with something like the one below, and there’d be no extra software to distribute to the nodes. It combines the two suming algorithms (grep|awk|tr|awk) then (bash) into a single awk, and runs the ssh’s in parallel so it might even be fast enough to be interesting. The only bashism is [[ ]] which is easily removed.
#! /bin/bash
HOSTS=”a b c d e”
if [[ “$1” = “config” ]]
then
echo graph_category Slurm
echo graph_title GPU usage cumulative
echo graph_vlabel GPU percent
echo gpuuse.label GPU usage
else
# read all GPU hosts’ info in parallel, as long as
# number of hosts << # file descriptors
for i in $HOSTS
do
ssh $i nvidea-smi &
done |
awk '/%/ { sum += $13 }
END { print "gpuse.value", sum }'
fi
LikeLike