[WARNING] Service returned no data for label gpu use error on munin node

If you have several Linux clients and no way to monitoring what’s going on munin is your friend. I have a couple of non-standard munin plugins that I’ve been scraping around so I was feeling brave enough to write a plugin myself. There are several blogs that tell you how to do that. Here you have the entry for python munin scripts. And here is the official one for python-munin. There’s also an official one for a shell munin plugin. This one is how to do one bash munin script, but this one also with a better example that I will take as a template. Do not forget there’s an official testing procedure also.With a corresponding short, for my taste, troubleshooting guide. I found also interesting this link to monitor mongodb with munin and this article from the Linux magazine.

Let’s not digress. What I want to do is monitor the gpu usage across my cluster from the login node. For this I need to prepare two scripts beforehand, one to read the local gpu activity, another one to do the math over ssh. This is how they look like:

#!/bin/bash
# gpu_total 
# explanation: get the lines with % 
# print the fields corresponding to current use
# format them, add them and print the sum
nvidia-smi | grep "%" | awk '{print $13}' | tr -d '\n' \
| awk ' BEGIN { FS = "%"}; {for(i=1;i<=NF ;i++){sum+=$i}}; \
{print sum} '

The one above (gpu_total) is assuming you can run nvidia-smi on the node. If that is not the case, maybe you can cook up something similar. Now on the login node, I ssh to the gpu node and run the script. Something like this:

#!/bin/bash
# gpu_sum
## do ssh to a client, perform there gpu_total
sum=0
for i in a b c d e ; do
 local=`ssh $i gpu_total`
 sum=`expr $sum + $local`
done
echo $sum

My graphical clients are named here a b c d e. I guess you could call them in another way, so do not forget to change that. I test the scripts and they work on my network, giving me one number at the end, like this: 12345.  Now, how do I hook up this info onto munin? This is my munin script, very similar to the second bash example I gave:

#! /bin/bash
case $1 in
 config)
 cat << EOF
graph_category Slurm
graph_title GPU usage cumulative
graph_vlabel GPU percent
gpuuse.label GPU usage
EOF
 exit 0
esac
echo "gpuuse.value" `read_gpu_total`

Be aware of the formatting! If you don’t format it right, the configuration will not be loaded. If you want to know about munin data types, here’s the link.

Now the two-days-of-work question: what does read_gpu_total gives? If yu try to run the gpu_sum script directly, you will get an ssh error, since munin, in my case, is not having rights to do passwordless ssh to all our GPU servers. Also, the ssh and the calculation takes in my case around 30 seconds (I have a lot of gpus) so I will run a crontab that writes the number into a file. Let’s say in /tmp/gpus. Now I want the plugin to read the file, and draw the output. We read it like this:

cat /tmp/gpus | tail -1 | bc

Note that you could put that directly in the script. I load the script and wait 5 minutes, to get the error on the title. It looks like I was not the first one with the issue, and here you have another on sourceforge. None of them shed any light for me. What it is? How to fix it? If you check the node logs, the error is not very clear:

## > tail -50 /var/log/munin-node/munin-node.log
some-date-here [10535] Error output from slurm_gpus:
some-date-here [10535] cat: /tmp/gpus.txt: 
No such file or directory

Of course the file is there! I can check its content with more, and I can run the command on my root prompt. What the hell is going on? I will tell you. Munin can not read it! Change the ownership of the output file so that munin can read it. In my case,

-rw-r--r-- 1 nobody munin size Month Day HH:MM gpus

Finally, I can tell the boss we don’t need more GPUs, since we are on the average below 50% of total occupancy. Thanks munin for that, less work installing cards 😀

Bonus track: list of munin open issues.

Advertisements

About bitsanddragons

A traveller, an IT professional and a casual writer
This entry was posted in bash, bits, GPU, linux, munin, slurm, software. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s