CryoSPARC 2 slurm cluster worker update error

This is about CryoSPARC again. Previously we did install it on CentOS and update it, but on a master + node configuration, not on a cluster configuration. If it’s a new install on your slurm cluster, you should follow the master  installation guide, that tells you to make a master install on the login node, then, on the same login node install the worker:

module load cuda-XX
cd cryosparc2_worker
./ --license $LICENSE_ID --cudapath 

The situation is that we update the master node but the Lane default (cluster) doesn’t get the update and the jobs crash because of it. First we uninstall the worker using one of the management tools like this:

cryosparcm cli 'remove_scheduler_target_node("cluster")'

Then we cryosparc stop and we move the old worker software folder

mv cryosparc2_worker cryosparc2_worker_old

and get a new copy of the worker software with curl.

curl -L\
download/worker-latest/$LICENSE_ID \ > cryosparc2_worker.tar.gz

We cryosparc start, and untar, cd, and install. Don’t forget to add your LICENSE_ID and to load the cuda module or be sure you have one cuda by default. This is an edited extract of my worker install:

******* CRYOSPARC SYSTEM: WORKER INSTALLER ***********************

Installation Settings:
License ID :  XXXX
Root Directory : /XXX/Software/Cryosparc/cryosparc2_worker
Standalone Installation : false
Version : v2.5.0


CUDA check..
Found nvidia-smi at /usr/bin/nvidia-smi

CUDA Path was provided as /XXX/cuda/9.1.85
Checking CUDA installation...
Found nvcc at /XXX/cuda/9.1.85/bin/nvcc
The above cuda installation will be used but can be changed later.


Setting up hard-coded environment variables


Installing all dependencies.

Checking dependencies... 
Dependencies for python have changed - reinstalling...
Installing anaconda python...
installing: python-2.7.14-h1571d57_29 ...

...anaconda being installed...
installation finished.
anaconda python installation successful.
Preparing to install all conda packages...
conda packages installation successful.
Preparing to install all pip packages...
Processing ./XXX/pip_packages/Flask-JSONRPC-0.3.1.tar.gz

Running install for pluggy ... done
Successfully installed Flask-JSONRPC-0.3.1 
Flask-PyMongo-0.5.1 libtiff-0.4.2 pluggy-0.6.0 
pycuda-2018.1.1 scikit-cuda-0.5.2
You are using pip version 9.0.1, 
however version 19.1.1 is available.
You should consider upgrading via the
 'pip install --upgrade pip' command.
pip packages installation successful.
Main dependency installation completed. Continuing...
Currently checking hash for ctffind
Dependencies for ctffind have changed - reinstalling...
ctffind 4.1.10 installation successful.
Currently checking hash for gctf
Dependencies for gctf have changed - reinstalling...
Gctf v1.06 installation successful.
Completed dependency check.


In order to run processing jobs, you will need to connect this
worker to a cryoSPARC master.


We are adding a worker that was somehow previously there, so I don’t do anything else. If I check the web, Lane default (cluster) is back. Extra tip: the forum entries about a wrong default, and about the Slurm settings for cryosparc v2.

If I need to add something: be aware that the worker install looks like coming with its own python, and it does reinstall cftfind and gctf. So be careful if you run python things in addition to cryosparc 🙂


Install and use of R in CentOS 7.6

I want to create plots from my slurm cluster. And I’ve decided to do in in a modern way, with R. Let’s go through the install of R on CentOS 7 first, the install of the R pacakges, and then let’s generate some plots.

 yum install R -y

After a lot of packages, I end up with my R prompt. It looks like this:

## > R

R version 3.5.3 (2019-03-11) -- "Great Truth"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

Now we need to install the data table. This is done by simply typing on the R prompt. This is what happens:

> install.packages("data.table")
Installing package into ‘/usr/lib64/R/library’
(as ‘lib’ is unspecified)
--- Please select a CRAN mirror for use in this session ---

And I get a pop-up window with a list of servers. Nice! After the selection, the package is downloaded, compiled and installed. At the end, it looks like this:

* DONE (data.table)
Making 'packages.html' ... done

The downloaded source packages are in
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done

I leave typing q() and I save the workspace image. All of this I do of course on the login node where the slurmdb daemon is running. I get now the slurm-stats scripts.

 git clone

And I do some test on the folder the R scripts lay. I generate a data file like this:

sacct --format User,Partition,Submit,Start > sisu

It is a variation of the minimal setup. I removed the parameters I don’t know what they do. I run the R script… that fails miserably.

Error in eval(bysub, parent.frame(), parent.frame()) : 
object 'Partition' not found
Calls: [ -> [.data.table -> eval -> eval
Execution halted

Let’s try with a little bit of care, that is, adding a start date.

sacct --format User,Partition,Submit,Start -P -a -S 04/19 > sisu
R --no-save --args "sisu" < sacct_stats_queue_dist.R

The output tells me this at the end.

> write.csv(out,paste(filename,"_out.csv",sep=""),
row.names=FALSE, na="")

And I have my CSV (Comma Separated Values) file generated after the sacct. A CSV that I can get via a script, in a relatively easy way. Now time to tune this up. And plot it. And…gosh, but it’s late, I’m tired and I think I should leave here and check it out next time. So see you around!


Slurm 18.08 database usage

Previously I told you about the problems you may have configuring the mariadb for the slurm daemon. Now I’m going to close this chapter by a little cheatsheet on accounting based on the sreport documentation and  the sacctmgr documentation. I will call my cluster super. We can see cluster name, control host and so on this way:

## > sacctmgr show cluster
Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS
------- ---------
super 123.456.789 6817 8448 1 normal

The IP of my ControlHost is obviously not 123.456.789, but you get the idea. We can check the utilization like this:

## > sreport cluster utilization -t percent
Cluster Utilization 20XX 00:00:00 - 20XX 23:59:59
Usage reported in Percentage of Total
Cluster   Allocate  Down     PLND Dow Idle     Reserved Reported
--------- --------  -------- -------- -------- -------- --------
super      20.79%   0.00%    0.00%    79.10%   0.11%    100.00%

The command shows the super cluster usage for a day.  We can specify the time period using the keywords “start” and “end”. Like start=2/16/09 end=2/23/09.  Or for a specific resource, for example the GPUs. Like this:

sreport cluster utilization start=XXX end=YYY --tres=gres/gpu:GPUTYPE -t percent

I’m going to show everything in percentages if possible. It is neutral and clear. If you want to show it “raw” just remove the “-t percent” on my command lines. One thing you must define in your cluster is an account and users. Otherwise stats will show up empty. We will create now an account called users, list the accounts, then create an user. All the users in this example were generated using a random name generator.

## > sacctmgr create account name=users
Adding Account(s)
Description = Account Name
Organization = Parent/Account Name
A = users C = super
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
## > sacctmgr list account
Account Descr Org
---------- -------------------- ---------
root     default root account root
users    users                users
## > sacctmgr create user name=verowlet account=users cluster=super
Adding User(s)
Associations =
U = verowlet A = users C = super
Non Default Settings
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y

We go on with adding users to the “users” account until we have them all. If you do it in a for loop over the list, it may be that you add twice the user by mistake. The system will not complain about it. A nice Nothing new added message will be displayed on your shell. If you want to remove an user dummytom from the database:

## > sacctmgr remove user dummytom
Deleting users...
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y

What if the user has a “strange” name? We may need to format the output to see it. This is done with the format keyword. For example:

## > sacctmgr show user format=user%20s,defaultaccount%30s
User                  Def Acct
-----------------    ---------------
dumdumb:+*+:dumb:**  users
verowlet             users

Now we can remove the user that is not an user but a mistype or a non-filtered element of my user list, dumdumb:+*+:dumb:**. BTW, I have a lot of users, in an UNIX group. They are AD users, and the integration was flawless. So the database works with AD users. Shall we check oour top 10 users?

## > sreport user top -t percent
Top 10 Users XXX 00:00:00 -   XXX 23:59:59 (86400 secs)
Usage reported in Percentage of Total
Cluster Login Proper Name Account Used Energy
----- --------- --------------- ------- -------- -------

super  verowle Veronica Rowlette user 11.37% 0.00%
super  gaalbus Gaylene Albus user 3.18% 0.00%
super  cibaugh Cicely Baugher user 1.77% 0.00%
super  jokrein Jon Krein user 1.38% 0.00%
super  doleyden Douglas Leyden user 0.82% 0.00%

If we want to check the running jobs for an specific user we can do it by sacc -u username –state=R. I’m not going to put the output.

To end up, you can check this Luxemburgian page with plenty of slurm comand exaples and this one with niflheim slurm accounting examples. The same than this, but condensed 😀

Slurm 18.08 with QOS mariadb problems on CentOS 7

I already told you how to install Slurm on CentOS 7 so I’m not going to repeat it for a  modern slurm package. I’m going to comment on the new issues I had using the procedure. Problem one: making rpms.

rpmbuild -ta slurm-15.08.9.tar.bz2

This I solved by using a variation of this solution. I just did it as root.

yum install 'perl(ExtUtils::Embed)' 'rubygem(minitest)'

You could also configure, make and make install the source code. Once done, I run a script to copy my slurm rpms or my slurm source code to the local machine, clean up the previous installation (deleting the packages and the munge and slurm users and folders) and install everything (munge + slurm).

Problem two: the slurm database configuration. I’m going to start from a working installation of 18.08. That means you can submit jobs, they run and so on. First time I did a modification on it I screwed up the queuing system: all jobs got stucked with status CG. The solution to stucked CG jobs is scancel followed by.

scontrol update NodeName=$node State=down Reason=hung_proc
scontrol update NodeName=$node State=resume

Of course it is normal to commit mistakes if you play around. On Sloppy Linux Notes they have a very short guide about how to install a mariadb with slurm. Please try out the above method before go on reading, this one is a sad story 😦

So I had it already installed on my database client, but I was not using it. Instead of removing all the little bits and pieces, I tried to reset the mariadb root password. Note that you may want to recover the mysqld password instead. In any case, this is the error:

root@node ~ ## > mysql -u root -p
Enter password: 
ERROR 1045 (28000): Access denied for user 'root'@'localhost' 
(using password: YES)

Even with the right password. Depending on your install, skip grant tables may work, in my case, I get this

MariaDB [(none)]> ALTER USER 'root'@'localhost' 
ERROR 1064 (42000): You have an error in your SQL syntax; 
check the manual that corresponds to your MariaDB server version 
for the right syntax to use near 'USER 'root'@'localhost' 

I check the documentation as suggested, but I still don’t manage. Even some potsts about the problem on a mac. I tried generating a password hash…but without luck. This works:

MariaDB [(none)]> SET PASSWORD FOR 'root'@'localhost' 
= PASSWORD('NewPass'); 
Query OK, 0 rows affected (0.00 sec)

But I can’t login as root after flushing the privileges and removing the skip-grant-tables from my.cnf. On the DigitalOcean they advice to alter user also, but instead of modifying the my.cnf, they suggest to start the database skipping the grant tables

mysqld_safe --skip-grant-tables --skip-networking &

My mariadb version is 5.5

root@node > rpm -qa | grep mariadb 


MariaDB[(none)]> SET PASSWORD FOR 'root'@'localhost' 
= PASSWORD('NewPass');

Now I can log in as root with my new password. What’s next? Yes, we need to setup the mariadb slurm user and the slurm tables.

MariaDB [(none)]> CREATE USER 'slurm'@'node' 
Query OK, 0 rows affected (0.00 sec)
MariaDB [(none)]> create database slurm_acct_db;
Query OK, 1 row affected (0.00 sec)
`slurm_acct_db`.* TO 'slurm'@'node' with grant option;
Query OK, 0 rows affected (0.00 sec)
MariaDB [(none)]> grant all on slurm_acct_db.* TO 'slurm'@'node'
-> identified by 'SLURMPW' with grant option;
Query OK, 0 rows affected (0.00 sec)
MariaDB [(none)]> flush privileges;
Query OK, 0 rows affected (0.00 sec)

Here you have how to add a user to mariadb with all privileges in case you need more info. And the documentation on GRANT. And all in a nutshell with a script. If you have problems with the database (for example it is corrupted)

root@node> more /var/log/mariadb/mariadb.log
XXX [ERROR] Native table 'performance_schema'.'rwlock_instances' 
has the wrong structure

you may want to DROP it or rebuild all the databases.

root@node ~ ## > mysql_upgrade -uroot -p --force
Enter password: 
MySQL upgrade detected
Phase 1/4: Fixing views from mysql
Phase 2/4: Fixing table and database names
Phase 3/4: Checking and upgrading tables
Processing databases

After such an action, it may be interesting to get a list of mariadb users and rights.  Or show your grants:

MariaDB [(none)]> show grants

But let’s don’t look back and go ahead. If after all this troubles you didn’t give up and you have a mariadb running, it’s time to configure the slurmdbd daemon. Our slurmdbd.conf should look like this:


We can start the daemon now…and here comes the section for slurmdbd errors.

Error:  ConditionPathExists=/etc/slurm/slurmdbd.conf was not met
Solution: Check the file exist, has that name and it is accessible by ‘node’.

Error:  This host (‘node’) is not a valid controller
Solution: Check your slurm.conf, where it is defined the controller in ‘ControlMachine’

Error:  mysql_real_connect failed: 2003 Can’t connect to MySQL server on ‘node’
Solution: Check StorageHost=XXX on your slurmdbd.conf. and AccountingStorageHost=XXX on slurm.conf Change it for an IP instead of name.

Error:  mysql_real_connect failed: 1045 Access denied for user ‘slurm’@’node’ (using password: YES)
Solution: Check that you can log in as ‘slurm’ with SLURMPW on myslq. If not, you need to create a user that is able to do that.

Error:  Couldn’t load specified plugin name for accounting_storage/mysql: Plugin init() callback failed
Solution: Check that your mariadb is up and running. Check that you have the You may need to recompile everything…

Error:  It looks like the storage has gone away trying to reconnect
Solution: Check that the cluster is seen by the accounting system. If not, you need to add it using an account manager command

root@node ## > sacctmgr add cluster MYCLUSTER

We need to set QOS also. To do so, maybe we need to use the consumable resource allocation plugin select/cons_res, that is to say, tell slurm to manage the CPUs, RAM, and GPUs.  Add to your slurm.conf something like this:


There are a lot of examples of the slurm documentation on cons_res. Be aware that there is cons_res bug on hardened systems if you compile slurm hardening. Let’s define some QOS as in the documentation.

sacctmgr add qos zebra

And see how they look like:

sacctmgr show qos format=name,priority
Name       Priority 
---------- ---------- 
normal     0 
zebra      0 
elephant   0

Now everything should be fine. We check:

root@node ## > slurmctld -Dvvv

If you need it, here you have the QOS at the biocluster. And the official documentation on slurm accounting. And I’m pretty tired of fixing things, distributing files, and look at logs. I hope you didn’t need it at all. At the end, just to finish this collection of troubles, the slurm problems page from SCSC. Happy slurming…

[WARNING] Service returned no data for label gpu use error on munin node

If you have several Linux clients and no way to monitoring what’s going on munin is your friend. I have a couple of non-standard munin plugins that I’ve been scraping around so I was feeling brave enough to write a plugin myself. There are several blogs that tell you how to do that. Here you have the entry for python munin scripts. And here is the official one for python-munin. There’s also an official one for a shell munin plugin. This one is how to do one bash munin script, but this one also with a better example that I will take as a template. Do not forget there’s an official testing procedure also.With a corresponding short, for my taste, troubleshooting guide. I found also interesting this link to monitor mongodb with munin and this article from the Linux magazine.

Let’s not digress. What I want to do is monitor the gpu usage across my cluster from the login node. For this I need to prepare two scripts beforehand, one to read the local gpu activity, another one to do the math over ssh. This is how they look like:

# gpu_total 
# explanation: get the lines with % 
# print the fields corresponding to current use
# format them, add them and print the sum
nvidia-smi | grep "%" | awk '{print $13}' | tr -d '\n' \
| awk ' BEGIN { FS = "%"}; {for(i=1;i<=NF ;i++){sum+=$i}}; \
{print sum} '

The one above (gpu_total) is assuming you can run nvidia-smi on the node. If that is not the case, maybe you can cook up something similar. Now on the login node, I ssh to the gpu node and run the script. Something like this:

# gpu_sum
## do ssh to a client, perform there gpu_total
for i in a b c d e ; do
 local=`ssh $i gpu_total`
 sum=`expr $sum + $local`
echo $sum

My graphical clients are named here a b c d e. I guess you could call them in another way, so do not forget to change that. I test the scripts and they work on my network, giving me one number at the end, like this: 12345.  Now, how do I hook up this info onto munin? This is my munin script, very similar to the second bash example I gave:

#! /bin/bash
case $1 in
 cat << EOF
graph_category Slurm
graph_title GPU usage cumulative
graph_vlabel GPU percent
gpuuse.label GPU usage
 exit 0
echo "gpuuse.value" `read_gpu_total`

Be aware of the formatting! If you don’t format it right, the configuration will not be loaded. If you want to know about munin data types, here’s the link.

Now the two-days-of-work question: what does read_gpu_total gives? If yu try to run the gpu_sum script directly, you will get an ssh error, since munin, in my case, is not having rights to do passwordless ssh to all our GPU servers. Also, the ssh and the calculation takes in my case around 30 seconds (I have a lot of gpus) so I will run a crontab that writes the number into a file. Let’s say in /tmp/gpus. Now I want the plugin to read the file, and draw the output. We read it like this:

cat /tmp/gpus | tail -1 | bc

Note that you could put that directly in the script. I load the script and wait 5 minutes, to get the error on the title. It looks like I was not the first one with the issue, and here you have another on sourceforge. None of them shed any light for me. What it is? How to fix it? If you check the node logs, the error is not very clear:

## > tail -50 /var/log/munin-node/munin-node.log
some-date-here [10535] Error output from slurm_gpus:
some-date-here [10535] cat: /tmp/gpus.txt: 
No such file or directory

Of course the file is there! I can check its content with more, and I can run the command on my root prompt. What the hell is going on? I will tell you. Munin can not read it! Change the ownership of the output file so that munin can read it. In my case,

-rw-r--r-- 1 nobody munin size Month Day HH:MM gpus

Finally, I can tell the boss we don’t need more GPUs, since we are on the average below 50% of total occupancy. Thanks munin for that, less work installing cards 😀

Bonus track: list of munin open issues.

Slurm update on CentOS 7

We have a big disconnection time (finally!) and everybody is happily taking coffee and discussing about what to do during the holidays while I’m alone running around updating hardware and software.  I have updated the NVIDIA drivers, the kernel, the GPFS system, and now slurm. I download the latest version from here.  Now instead of only compiling it I’m going to make rpm packages on the node that runs the controller and distribute them around. Like this:

rpmbuild -ta slurm-17.11.5.tar.bz2

My rpms appear on /root/rpmbuild/RPMS/x86_64/, since I built them as root. I copy them to my network location and simply install them using my installation script. My installation script is a “one liner” with the packages in the right order:

yum --nogpgcheck localinstall \
slurm-17.11.5-1.el7.centos.x86_64.rpm \
slurm-devel-17.11.5-1.el7.centos.x86_64.rpm \
slurm-munge-17.11.5-1.el7.centos.x86_64.rpm \
slurm-openlava-17.11.5-1.el7.centos.x86_64.rpm \
slurm-pam_slurm-17.11.5-1.el7.centos.x86_64.rpm \
slurm-perlapi-17.11.5-1.el7.centos.x86_64.rpm \
slurm-plugins-17.11.5-1.el7.centos.x86_64.rpm \
slurm-seff-17.11.5-1.el7.centos.x86_64.rpm \
slurm-sjobexit-17.11.5-1.el7.centos.x86_64.rpm \
slurm-sjstat-17.11.5-1.el7.centos.x86_64.rpm \
slurm-slurmdbd-17.11.5-1.el7.centos.x86_64.rpm \
slurm-slurmdb-direct-17.11.5-1.el7.centos.x86_64.rpm \
slurm-sql-17.11.5-1.el7.centos.x86_64.rpm \

Note that I made the rpms by compiling the tarball (rpmbuild) so if you managed to run the command, your rpms will work on your systems. We will see… now I need to wait until “the others” have “the other pieces” ready to be assembled…my job can’t go beyond this point. This is how it is, and this is how I like it 🙂

EDIT: in addition to the yum command above, I was forced to do an additional yum install slurm-slurmctld-17.11.5-1.el7.centos.x86_64.rpm (for the login node) and yum install  slurm-slurmd-17.11.5-1.el7.centos.x86_64.rpm (for the computing nodes). Otherwise the services (slurmctld and slurmd) don’t seem to be installed.  

A MATLAB Runtime module

First we download the chosen package from the MATLAB site. Then we unzip it on a test folder and run the provided install script. 5081690_1_4The procedure is quite easy, no programming knowledge is needed. We choose the path to our run time software in a way we  can easily reach it from everywhere: a network folder. On this example, we install it on /net/local/MCR_R2017b/.

The installation takes in my case around 4 minutes. After it we have a sub-folder on MCR_R2017b called v93. That will be our “topdir”. The module that calls this software is like follows.

## modules MCR_R2017b
## MCR_R2017bmodule
proc ModulesHelp { } {
 global version modroot

puts stderr "MCR_R2017b - 
 sets the Environment for using MCR_R2017b"

module-whatis "Sets the environment for using MCR_R2017b "

# for Tcl script use only
set topdir /net/local/MCR_R2017b/v93
set version R2017b
set sys linux86

prepend-path PATH $topdir/bin

# Location in the distribution expected to have an MCR
setenv MCR_ROOT $topdir

prepend-path LD_LIBRARY_PATH 

Now we test it. We’re going to compile a helloworld.m. The file content is below.

function helloworld
 fprintf('\nHello, World!\n')

We go to a client with a MATLAB license to compile it. We use the simplest compilation.

me@matlab-client $ > mcc -m helloworld.m
me@matlab-client $ > ls
helloworld* helloworld.m 
mccExcludedFiles.log readme.txt 

As you see, it works, but a lot of other files are generated. To test our executable is independent of MATLAB, we go to another client (normal-client), load the MCR module and run the compile code.

me@normal-client $ > module load MCR_R2017b 
me@normal-client $ > ./ 
./ <deployedMCRroot> args
me@normal-client $ > ./helloworld

Hello, World!

So it seems to be OK. If you want to check the basics of matlab compilation you can go to this link. We will see now for real, big stuff…