3 Bandwidth monitoring tools

zabbix-dashboard-1I managed to build a vnstat web interface to plot the network traffic but the latency and the look is not the best. Also, it relies on crontabs and ssh connections, and it lacks of alarms. Now that I have a working solution,  I decided to searched a little bit more. Here you have 16 useful bandwidth monitoring tools, and my comments on some of them.

Zabbix seems to be a fully configurable dashboard, with everything you may want. The installation procedure is here. Zabbix can collect different type of data than are used to create historical graphics and output performance or load trends of the monitored targets. Suitable for small, private, and homogeneous networks, I believe.

Observium is said to supports a wide range of operating systems and hardware platforms including, Linux, Windows, FreeBSD, Cisco, HP, Dell, NetApp and so on. This is quite interesting if you want to monitor all, even the printers. The installation procedure is here . If you have a look at the demo version, you see how complicated it can become. My opinion: suitable for big, distributed, heterogeneous networks. This is not my case, so I’m not going to use it.

Cacti is the last one that I considered keeping as my final tool. It says it is used to graph time-series data of metrics such as network bandwidth utilization, CPU load, running processes, disk space etc. Therefore is like munin, but fully configurable. Here you have the installation procedure. I’m not sure RRDtool is the best one for my plots, but you may want to give it a try. I’m going to say here that cacti is suitable for small, private and heterogeneous networks.

Of course it’s a lot of work to install and configure all of them, si if you have a chance, don’t waste your time and go for the one you like more. In my case, it’s going to be Zabbix. BTW, anyone knows how the web look of its dashboard is called?

Advertisements

A docker munin server on CentOS 7

The plan is to get rid of the old pizza box servers (around 20 Kg, 1U flat, with two power supplies) that are used, at this moment, only for services, like to run munin. I will build a munin docker and dump at least one of the monsters. The other ones I will sort out in a similar way. But let’s examine the munin docker server solutions first:

I will say the Wrender option is a killer. It will work, but it has a lot of unnecessary stuff. I can tell you I have two wrender LAMPs already running in parallel no problem. But I don’t need php or PhpMyAdmin, so let’s go for something completely different.

The scalingo munin looks like full of features, some of them unwanted by us. Anyway, I try it. Downloading and configuring it was easy. Once it seems to run, I open my browser on the desired port (8080 by default if you run the example) just to find the

Munin has not run yet. Please try again in a few moments.

message that is announced you should find. OK, it can be because of the nodes:

 -e NODES="server1:10.0.0.1 server2:10.0.0.2" \

(check again the examples) or because I didn’t configure the docker network properly. I have nodes already running munin clients, so I add of course those ones. I can ping the docker from the node, and the node from the docker, but still no graphs. Also, I don’t have very clear what to do to start the munin service and to stop it.  I go therefore ahead to the next option.

The shaf munin is very easy to download and run also. What I get when I point my browser to the expected address looks like a client-version of the munin server: basically we have one munin server per client, so to say. But it runs, and in 5 minutes I get my graphs, so I hook up some of my servers by editing the /etc/munin/munin.conf file and adding them under the default one:

[unRAID]
 address 127.0.0.1
 use_node_name yes

I need also to modify on the clients the /etc/munin/munin-client.conf. I ssh to one client with the running munin client, and add the IP of my shaf munin docker. I will say my munin docker has 1.1.1.2, something like:

allow ^127\.0\.0\.1$
allow ^1\.1\.1\.2$
allow ^::1$

After that, I restart the munin service in the docker and in my client. Do I get the graphs of my client? No, I don’t. The error reads on the client:

root@client ## > tail /var/log/munin-node/munin-node.log 
Binding to TCP port 4949 on host 0.0.0.0 with IPv4
Setting gid to "0 0"
DATE: Server closing!
Process Backgrounded
DATE: Munin::Node::Server (type Net::Server::Fork) 
starting! pid(839)
Resolved [*]:4949 to [0.0.0.0]:4949, IPv4
Binding to TCP port 4949 on host 0.0.0.0 with IPv4
Setting gid to "0 0"
DATE: CONNECT TCP Peer: "[1.1.1.1]:60494" 
Local: "[1.1.1.3]:4949"
DATE: [1637] Denying connection from: 1.1.1.1

So what is going on? Easy! The server 1.1.1.1 is denying the connection of the client (with IP 1.1.1.3). What is that server 1.1.1.1? Buggers! It’s my docker server!  I mean, the physical server that run the munin docker instance. I add his IP to the allow list on my client , so now it looks like this:

allow ^127\.0\.0\.1$
allow ^1\.1\.1\.2$
allow ^1\.1\.1\.1$
allow ^::1$

After that, I restart the munin service in the docker and in my client.  And in 15 minutes, my graphs start to appear. Victory! Time to dump the old hardware 😀 😀

ssh to docker ‘Permission denied’

I want to ssh to my docker. I know, it’s weird, but I want to. The traditional way to get a bash on my docker

 docker exec -it mydocker /bin/bash

is working, but it’s not the real thing.  I access to the docker as above, then install the open ssh packages.  The procedure will vary from system to system (ubuntu with apt-get, centos with yum). I will also generate an ssh key and so on, even if it is not needed, and copy it to the computer ‘client‘. This is how it looks like.

root@mydocker# ssh-keygen -t rsa
Generating public/private rsa key pair.
...some random art here...
root@mydocker:# cat .ssh/id_rsa.pub | 
ssh root@client 'cat >> .ssh/authorized_keys'
root@client's password: 
root@mydocker:~# ssh client
---> OK!
root@client ~ ## > exit

Now, from the bash I got via docker exec, I ssh to another client and I can. The problem is from client to mydocker. If mydocker has the local IP 1.1.1.1, it looks like this:

root@client ~ ## > ssh -Y root@1.1.1.1 -p 2222
root@1.1.1.1's password: 
Permission denied, please try again.

Of course, I mapped the port when I created the docker, and as I said, I can ssh to client from mydocker. How to fix this? I will simply rsync the /etc/ssh/ and /root/.ssh/ folders from client to mydocker. Something like:

root@mydocker:~# rsync -av root@client:/etc/ssh/ 
/etc/ssh/ --delete-after --progress
receiving file list ... 
15 files to consider
...here the files are coming...
root@mydocker:~# service ssh restart
@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: UNPROTECTED PRIVATE KEY FILE! @
@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0640 for '/etc/ssh/XXX' are too open.
It is required that your private key files are 
NOT accessible by others.
This private key will be ignored.
key_load_private: bad permissions
Could not load host key: /etc/ssh/XXX
 * Restarting OpenBSD Secure Shell server sshd 
@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: UNPROTECTED PRIVATE KEY FILE! @
@@@@@@@@@@@@@@@@@@@@@@@@@@
..the same than the previous warning...

root@mydocker:~# rsync -av root@client:/root/.ssh/ 
.ssh/ --delete-after --progress 
receiving file list ... 
8 files to consider
..here the files are coming ...
root@mydocker:~# service ssh restart
@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: UNPROTECTED PRIVATE KEY FILE! @
@@@@@@@@@@@@@@@@@@@@@@@@@@
..the same than the previous warning...
@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: UNPROTECTED PRIVATE KEY FILE! @
@@@@@@@@@@@@@@@@@@@@@@@@@@
..the same than the previous warning...
root@mydocker:~# exit

I did two rsync and two ssh restart. And I got 4 warnings. I don’t know if the warnings (Could no load host key) are particular to my settings, but I decided to ignore them. They are warning, after all. Now I can password-less ssh from client to mydocker. Standard output (mydocker is an ubuntu docker)

root@client ~ ## > ssh -Y root@1.1.1.1 -p 2222
Welcome to Ubuntu 16.04.4 LTS 
(GNU/Linux 3.10.0-693.21.1.el7.x86_64 x86_64)

* Documentation: https://help.ubuntu.com
 * Management: https://landscape.canonical.com
 * Support: https://ubuntu.com/advantage

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are 
described in the individual files in 
/usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, 
to the extent permitted by applicable law.

/usr/bin/xauth: file /root/.Xauthority does not exist
root@mydocker:~# exit

This procedure is not clean (look at the warnings) but if it works, I’m OK. Are you?

[WARNING] Service returned no data for label gpu use error on munin node

If you have several Linux clients and no way to monitoring what’s going on munin is your friend. I have a couple of non-standard munin plugins that I’ve been scraping around so I was feeling brave enough to write a plugin myself. There are several blogs that tell you how to do that. Here you have the entry for python munin scripts. And here is the official one for python-munin. There’s also an official one for a shell munin plugin. This one is how to do one bash munin script, but this one also with a better example that I will take as a template. Do not forget there’s an official testing procedure also.With a corresponding short, for my taste, troubleshooting guide. I found also interesting this link to monitor mongodb with munin and this article from the Linux magazine.

Let’s not digress. What I want to do is monitor the gpu usage across my cluster from the login node. For this I need to prepare two scripts beforehand, one to read the local gpu activity, another one to do the math over ssh. This is how they look like:

#!/bin/bash
# gpu_total 
# explanation: get the lines with % 
# print the fields corresponding to current use
# format them, add them and print the sum
nvidia-smi | grep "%" | awk '{print $13}' | tr -d '\n' \
| awk ' BEGIN { FS = "%"}; {for(i=1;i<=NF ;i++){sum+=$i}}; \
{print sum} '

The one above (gpu_total) is assuming you can run nvidia-smi on the node. If that is not the case, maybe you can cook up something similar. Now on the login node, I ssh to the gpu node and run the script. Something like this:

#!/bin/bash
# gpu_sum
## do ssh to a client, perform there gpu_total
sum=0
for i in a b c d e ; do
 local=`ssh $i gpu_total`
 sum=`expr $sum + $local`
done
echo $sum

My graphical clients are named here a b c d e. I guess you could call them in another way, so do not forget to change that. I test the scripts and they work on my network, giving me one number at the end, like this: 12345.  Now, how do I hook up this info onto munin? This is my munin script, very similar to the second bash example I gave:

#! /bin/bash
case $1 in
 config)
 cat << EOF
graph_category Slurm
graph_title GPU usage cumulative
graph_vlabel GPU percent
gpuuse.label GPU usage
EOF
 exit 0
esac
echo "gpuuse.value" `read_gpu_total`

Be aware of the formatting! If you don’t format it right, the configuration will not be loaded. If you want to know about munin data types, here’s the link.

Now the two-days-of-work question: what does read_gpu_total gives? If yu try to run the gpu_sum script directly, you will get an ssh error, since munin, in my case, is not having rights to do passwordless ssh to all our GPU servers. Also, the ssh and the calculation takes in my case around 30 seconds (I have a lot of gpus) so I will run a crontab that writes the number into a file. Let’s say in /tmp/gpus. Now I want the plugin to read the file, and draw the output. We read it like this:

cat /tmp/gpus | tail -1 | bc

Note that you could put that directly in the script. I load the script and wait 5 minutes, to get the error on the title. It looks like I was not the first one with the issue, and here you have another on sourceforge. None of them shed any light for me. What it is? How to fix it? If you check the node logs, the error is not very clear:

## > tail -50 /var/log/munin-node/munin-node.log
some-date-here [10535] Error output from slurm_gpus:
some-date-here [10535] cat: /tmp/gpus.txt: 
No such file or directory

Of course the file is there! I can check its content with more, and I can run the command on my root prompt. What the hell is going on? I will tell you. Munin can not read it! Change the ownership of the output file so that munin can read it. In my case,

-rw-r--r-- 1 nobody munin size Month Day HH:MM gpus

Finally, I can tell the boss we don’t need more GPUs, since we are on the average below 50% of total occupancy. Thanks munin for that, less work installing cards 😀

Bonus track: list of munin open issues.