CryoSPARC 2 slurm cluster worker update error

This is about CryoSPARC again. Previously we did install it on CentOS and update it, but on a master + node configuration, not on a cluster configuration. If it’s a new install on your slurm cluster, you should follow the master  installation guide, that tells you to make a master install on the login node, then, on the same login node install the worker:

module load cuda-XX
cd cryosparc2_worker
./install.sh --license $LICENSE_ID --cudapath 

The situation is that we update the master node but the Lane default (cluster) doesn’t get the update and the jobs crash because of it. First we uninstall the worker using one of the management tools like this:

cryosparcm cli 'remove_scheduler_target_node("cluster")'

Then we cryosparc stop and we move the old worker software folder

mv cryosparc2_worker cryosparc2_worker_old

and get a new copy of the worker software with curl.

curl -L https://get.cryosparc.com/\
download/worker-latest/$LICENSE_ID \ > cryosparc2_worker.tar.gz

We cryosparc start, and untar, cd, and install. Don’t forget to add your LICENSE_ID and to load the cuda module or be sure you have one cuda by default. This is an edited extract of my worker install:

******* CRYOSPARC SYSTEM: WORKER INSTALLER ***********************

Installation Settings:
License ID :  XXXX
Root Directory : /XXX/Software/Cryosparc/cryosparc2_worker
Standalone Installation : false
Version : v2.5.0

******************************************************************

CUDA check..
Found nvidia-smi at /usr/bin/nvidia-smi

CUDA Path was provided as /XXX/cuda/9.1.85
Checking CUDA installation...
Found nvcc at /XXX/cuda/9.1.85/bin/nvcc
The above cuda installation will be used but can be changed later.

***********************************************************

Setting up hard-coded config.sh environment variables

***********************************************************

Installing all dependencies.

Checking dependencies... 
Dependencies for python have changed - reinstalling...
---------------------------------------------------------
Installing anaconda python...
----------------------------------------------------------
PREFIX=/XXX/Software/Cryosparc/cryosparc2_worker/deps/anaconda
installing: python-2.7.14-h1571d57_29 ...

...anaconda being installed...
installation finished.
---------------------------------------------------------
Done.
anaconda python installation successful.
---------------------------------------------------------
Preparing to install all conda packages...
-----------------------------------------------------------
----------------------------------------------------------
Done.
conda packages installation successful.
------------------------------------------------------
Preparing to install all pip packages...
----------------------------------------------------------
Processing ./XXX/pip_packages/Flask-JSONRPC-0.3.1.tar.gz

Running setup.py install for pluggy ... done
Successfully installed Flask-JSONRPC-0.3.1 
Flask-PyMongo-0.5.1 libtiff-0.4.2 pluggy-0.6.0 
pycuda-2018.1.1 scikit-cuda-0.5.2
You are using pip version 9.0.1, 
however version 19.1.1 is available.
You should consider upgrading via the
 'pip install --upgrade pip' command.
-------------------------------------------------------
Done.
pip packages installation successful.
-------------------------------------------------------
Main dependency installation completed. Continuing...
-------------------------------------------------------
Completed.
Currently checking hash for ctffind
Dependencies for ctffind have changed - reinstalling...
--------------------------------------------------------
ctffind 4.1.10 installation successful.
--------------------------------------------------------
Completed.
Currently checking hash for gctf
Dependencies for gctf have changed - reinstalling...
-------------------------------------------------------
Gctf v1.06 installation successful.
-----------------------------------------------------------
Completed.
Completed dependency check.

******* CRYOSPARC WORKER INSTALLATION COMPLETE *****************

In order to run processing jobs, you will need to connect this
worker to a cryoSPARC master.

****************************************************************

We are adding a worker that was somehow previously there, so I don’t do anything else. If I check the web, Lane default (cluster) is back. Extra tip: the forum entries about a wrong default cluster_script.sh, and about the Slurm settings for cryosparc v2.

If I need to add something: be aware that the worker install looks like coming with its own python, and it does reinstall cftfind and gctf. So be careful if you run python things in addition to cryosparc 🙂

Advertisements

Slurm 18.08 database usage

Previously I told you about the problems you may have configuring the mariadb for the slurm daemon. Now I’m going to close this chapter by a little cheatsheet on accounting based on the sreport documentation and  the sacctmgr documentation. I will call my cluster super. We can see cluster name, control host and so on this way:

## > sacctmgr show cluster
Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS
------- ---------
super 123.456.789 6817 8448 1 normal

The IP of my ControlHost is obviously not 123.456.789, but you get the idea. We can check the utilization like this:

## > sreport cluster utilization -t percent
--------------------------------------------------------
Cluster Utilization 20XX 00:00:00 - 20XX 23:59:59
Usage reported in Percentage of Total
----------------------------------------------------
Cluster   Allocate  Down     PLND Dow Idle     Reserved Reported
--------- --------  -------- -------- -------- -------- --------
super      20.79%   0.00%    0.00%    79.10%   0.11%    100.00%

The command shows the super cluster usage for a day.  We can specify the time period using the keywords “start” and “end”. Like start=2/16/09 end=2/23/09.  Or for a specific resource, for example the GPUs. Like this:

sreport cluster utilization start=XXX end=YYY --tres=gres/gpu:GPUTYPE -t percent

I’m going to show everything in percentages if possible. It is neutral and clear. If you want to show it “raw” just remove the “-t percent” on my command lines. One thing you must define in your cluster is an account and users. Otherwise stats will show up empty. We will create now an account called users, list the accounts, then create an user. All the users in this example were generated using a random name generator.

## > sacctmgr create account name=users
Adding Account(s)
users
Settings
Description = Account Name
Organization = Parent/Account Name
Associations
A = users C = super
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
## > sacctmgr list account
Account Descr Org
---------- -------------------- ---------
root     default root account root
users    users                users
## > sacctmgr create user name=verowlet account=users cluster=super
Adding User(s)
verowlet
Associations =
U = verowlet A = users C = super
Non Default Settings
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y

We go on with adding users to the “users” account until we have them all. If you do it in a for loop over the list, it may be that you add twice the user by mistake. The system will not complain about it. A nice Nothing new added message will be displayed on your shell. If you want to remove an user dummytom from the database:

## > sacctmgr remove user dummytom
Deleting users...
dummytom
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y

What if the user has a “strange” name? We may need to format the output to see it. This is done with the format keyword. For example:

## > sacctmgr show user format=user%20s,defaultaccount%30s
User                  Def Acct
-----------------    ---------------
dumdumb:+*+:dumb:**  users
verowlet             users

Now we can remove the user that is not an user but a mistype or a non-filtered element of my user list, dumdumb:+*+:dumb:**. BTW, I have a lot of users, in an UNIX group. They are AD users, and the integration was flawless. So the database works with AD users. Shall we check oour top 10 users?

## > sreport user top -t percent
---------------------------------------------------
Top 10 Users XXX 00:00:00 -   XXX 23:59:59 (86400 secs)
Usage reported in Percentage of Total
-----------------------------------------------------------
Cluster Login Proper Name Account Used Energy
----- --------- --------------- ------- -------- -------

super  verowle Veronica Rowlette user 11.37% 0.00%
super  gaalbus Gaylene Albus user 3.18% 0.00%
super  cibaugh Cicely Baugher user 1.77% 0.00%
super  jokrein Jon Krein user 1.38% 0.00%
super  doleyden Douglas Leyden user 0.82% 0.00%

If we want to check the running jobs for an specific user we can do it by sacc -u username –state=R. I’m not going to put the output.

To end up, you can check this Luxemburgian page with plenty of slurm comand exaples and this one with niflheim slurm accounting examples. The same than this, but condensed 😀

CryoSPARC 2 management notes

You know how to install CryoSPARC 2 on CentOS 7. There are some simple commands that I keep repeating, and some complicated ones that I need to run from time to time. I keep adding and deleting nodes.

To add a worker cryo01 with ssd on /data to master cryomaster
cryosparcw connect --update --worker cryo01
--master cryomaster --ssdpath /data
To add a worker cryo01 to a new line fastlane
bin/cryosparcw connect --worker cryo01
--master cryomaster --ssdpath /data
--lane fastlane --newlane
Here (chapter 2.7) you can learn how to select a lane.
To remove tjhe lane fastlane (see post here)
cryosparcm cli "remove_scheduler_lane('fastlane')"
To remove the worker
cryosparcm cli 'remove_scheduler_target_node("sparc0")
To create an user
cryosparcm createuser
--email user@domain.edu --password CLEARTEXT-PASSWORD
--name "John Doe"
To remove an user (see post here)
cryosparcm icli
db['users'].delete_one({'user.0.domain':})

So user and worker management is kind of tricky. I look forward for the version that will let us manage directly resources and working load. I’ll keep you posted 🙂

Slurm 18.08 with QOS mariadb problems on CentOS 7

I already told you how to install Slurm on CentOS 7 so I’m not going to repeat it for a  modern slurm package. I’m going to comment on the new issues I had using the procedure. Problem one: making rpms.

rpmbuild -ta slurm-15.08.9.tar.bz2

This I solved by using a variation of this solution. I just did it as root.

yum install 'perl(ExtUtils::Embed)' 'rubygem(minitest)'

You could also configure, make and make install the source code. Once done, I run a script to copy my slurm rpms or my slurm source code to the local machine, clean up the previous installation (deleting the packages and the munge and slurm users and folders) and install everything (munge + slurm).

Problem two: the slurm database configuration. I’m going to start from a working installation of 18.08. That means you can submit jobs, they run and so on. First time I did a modification on it I screwed up the queuing system: all jobs got stucked with status CG. The solution to stucked CG jobs is scancel followed by.

scontrol update NodeName=$node State=down Reason=hung_proc
scontrol update NodeName=$node State=resume

Of course it is normal to commit mistakes if you play around. On Sloppy Linux Notes they have a very short guide about how to install a mariadb with slurm. Please try out the above method before go on reading, this one is a sad story 😦

So I had it already installed on my database client, but I was not using it. Instead of removing all the little bits and pieces, I tried to reset the mariadb root password. Note that you may want to recover the mysqld password instead. In any case, this is the error:

root@node ~ ## > mysql -u root -p
Enter password: 
ERROR 1045 (28000): Access denied for user 'root'@'localhost' 
(using password: YES)

Even with the right password. Depending on your install, skip grant tables may work, in my case, I get this

MariaDB [(none)]> ALTER USER 'root'@'localhost' 
IDENTIFIED BY 'NewPass';
ERROR 1064 (42000): You have an error in your SQL syntax; 
check the manual that corresponds to your MariaDB server version 
for the right syntax to use near 'USER 'root'@'localhost' 
IDENTIFIED BY 'NewPass'

I check the documentation as suggested, but I still don’t manage. Even some potsts about the problem on a mac. I tried generating a password hash…but without luck. This works:

MariaDB [(none)]> SET PASSWORD FOR 'root'@'localhost' 
= PASSWORD('NewPass'); 
Query OK, 0 rows affected (0.00 sec)

But I can’t login as root after flushing the privileges and removing the skip-grant-tables from my.cnf. On the DigitalOcean they advice to alter user also, but instead of modifying the my.cnf, they suggest to start the database skipping the grant tables

mysqld_safe --skip-grant-tables --skip-networking &

My mariadb version is 5.5

root@node > rpm -qa | grep mariadb 
mariadb-server-5.5.60-1.el7_5.x86_64
mariadb-devel-5.5.60-1.el7_5.x86_64
mariadb-libs-5.5.60-1.el7_5.x86_64
mariadb-5.5.60-1.el7_5.x86_64

So:

MariaDB[(none)]> SET PASSWORD FOR 'root'@'localhost' 
= PASSWORD('NewPass');

Now I can log in as root with my new password. What’s next? Yes, we need to setup the mariadb slurm user and the slurm tables.

MariaDB [(none)]> CREATE USER 'slurm'@'node' 
IDENTIFIED BY 'SLURMPW';
Query OK, 0 rows affected (0.00 sec)
MariaDB [(none)]> create database slurm_acct_db;
Query OK, 1 row affected (0.00 sec)
MariaDB [(none)]> GRANT ALL PRIVILEGES ON 
`slurm_acct_db`.* TO 'slurm'@'node' with grant option;
Query OK, 0 rows affected (0.00 sec)
MariaDB [(none)]> grant all on slurm_acct_db.* TO 'slurm'@'node'
-> identified by 'SLURMPW' with grant option;
Query OK, 0 rows affected (0.00 sec)
MariaDB [(none)]> flush privileges;
Query OK, 0 rows affected (0.00 sec)

Here you have how to add a user to mariadb with all privileges in case you need more info. And the documentation on GRANT. And all in a nutshell with a script. If you have problems with the database (for example it is corrupted)

root@node> more /var/log/mariadb/mariadb.log
XXX [ERROR] Native table 'performance_schema'.'rwlock_instances' 
has the wrong structure

you may want to DROP it or rebuild all the databases.

root@node ~ ## > mysql_upgrade -uroot -p --force
Enter password: 
MySQL upgrade detected
Phase 1/4: Fixing views from mysql
Phase 2/4: Fixing table and database names
Phase 3/4: Checking and upgrading tables
Processing databases

After such an action, it may be interesting to get a list of mariadb users and rights.  Or show your grants:

MariaDB [(none)]> show grants

But let’s don’t look back and go ahead. If after all this troubles you didn’t give up and you have a mariadb running, it’s time to configure the slurmdbd daemon. Our slurmdbd.conf should look like this:

/etc/slurm/slurmdbd.conf
AuthType=auth/munge
DbdAddr=localhost
DbdHost=localhost
SlurmUser=slurm
DebugLevel=4
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageHost=node
StoragePass=SLURMPW
StorageUser=slurm
StorageLoc=slurm_acct_db

We can start the daemon now…and here comes the section for slurmdbd errors.

Error:  ConditionPathExists=/etc/slurm/slurmdbd.conf was not met
Solution: Check the file exist, has that name and it is accessible by ‘node’.

Error:  This host (‘node’) is not a valid controller
Solution: Check your slurm.conf, where it is defined the controller in ‘ControlMachine’

Error:  mysql_real_connect failed: 2003 Can’t connect to MySQL server on ‘node’
Solution: Check StorageHost=XXX on your slurmdbd.conf. and AccountingStorageHost=XXX on slurm.conf Change it for an IP instead of name.

Error:  mysql_real_connect failed: 1045 Access denied for user ‘slurm’@’node’ (using password: YES)
Solution: Check that you can log in as ‘slurm’ with SLURMPW on myslq. If not, you need to create a user that is able to do that.

Error:  Couldn’t load specified plugin name for accounting_storage/mysql: Plugin init() callback failed
Solution: Check that your mariadb is up and running. Check that you have the accounting_storage.so. You may need to recompile everything…

Error:  It looks like the storage has gone away trying to reconnect
Solution: Check that the cluster is seen by the accounting system. If not, you need to add it using an account manager command

root@node ## > sacctmgr add cluster MYCLUSTER

We need to set QOS also. To do so, maybe we need to use the consumable resource allocation plugin select/cons_res, that is to say, tell slurm to manage the CPUs, RAM, and GPUs.  Add to your slurm.conf something like this:

SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

There are a lot of examples of the slurm documentation on cons_res. Be aware that there is cons_res bug on hardened systems if you compile slurm hardening. Let’s define some QOS as in the documentation.

sacctmgr add qos zebra

And see how they look like:

sacctmgr show qos format=name,priority
Name       Priority 
---------- ---------- 
normal     0 
zebra      0 
elephant   0

Now everything should be fine. We check:

root@node ## > slurmctld -Dvvv

If you need it, here you have the QOS at the biocluster. And the official documentation on slurm accounting. And I’m pretty tired of fixing things, distributing files, and look at logs. I hope you didn’t need it at all. At the end, just to finish this collection of troubles, the slurm problems page from SCSC. Happy slurming…

GDM not starting on CentOS 7.6

We did a big update, as I mentioned previously, and after the update I found out some of the graphic cards were not anymore supported by the newest NVIDIA drivers for CentOS x64, at this moment  NVIDIA-Linux-x86_64-418.43.run.  This may make some sense, since the Quadro 4000 was released in November 2010, but on the other hand it is a perfectly fine graphic card able to do 3D and hook up to 3 monitors. In principle I don’t like to throw away working hardware…unless requested to do so 🙂

I experienced all these symptoms depending on what I do:

  • you seem to have the drivers running but there is no output of nvidia-smi
  • you get an output that tells you that there is no device compatible.
  • there’s an ouput from nvidia-smi but GDM crashes

Sample GDM crash on my client “tiny” looks like this:

systemctl status gdm
● gdm.service - GNOME Display Manager
Loaded: loaded (/usr/lib/systemd/system/gdm.service; 
enabled; vendor preset: enabled)
Active: active (running) since XXX; 52s ago
Process: 24578 ExecStartPost=/bin/bash -c TERM=linux 
/usr/bin/clear > /dev/tty1 (code=exited, status=0/SUCCESS)
Main PID: 24575 (gdm)
CGroup: /system.slice/gdm.service
└─24575 /usr/sbin/gdm

XXX tiny systemd[1]: Starting GNOME Display Manager...
XXX tiny systemd[1]: Started GNOME Display Manager.
XXX tiny gdm[24575]: GdmDisplay: display lasted 0.093784 seconds
XXX tiny gdm[24575]: GdmDisplay: display lasted 0.031349 seconds
XXX tiny gdm[24575]: GdmDisplay: display lasted 0.017635 seconds
XXX tiny gdm[24575]: GdmDisplay: display lasted 0.016253 seconds
XXX tiny gdm[24575]: GdmDisplay: display lasted 0.016001 seconds
XXX tiny gdm[24575]: GdmDisplay: display lasted 0.017770 seconds
XXX tiny gdm[24575]: GdmLocalDisplayFactory: 
maximum number of X display failures reached: 
check X server log

Above, XXX corresponds to the date. We check the X server log as suggested. It reads:

root@tiny ~ ## > tail /var/log/Xorg.0.log
[ 344.316] ==== WARNING WARNING WARNING WARNING ================
[ 344.316] This server has a video driver 
ABI version of 24.0 that this
driver does not officially support. Please check
http://www.nvidia.com/ for driver updates or downgrade to an X
server with a supported driver ABI.
[ 344.316] =====================================================
[ 344.316] (EE) NVIDIA: Use the -ignoreABI option to 
override this check.
[ 344.316] (II) UnloadModule: "nvidia"
[ 344.316] (II) Unloading nvidia
[ 344.316] (EE) Failed to load module "nvidia" (unknown error, 0)
[ 344.316] (EE) No drivers available.

To get back GDM and a desktop environment for 418.43 and Quadro 4000 I tried uninstalling and installing again the 418.43 drivers, and to install and use lightdm instead of gdm. None of the solutions worked. Installing the previous drivers on the new kernel I end up with the message  Unable to load the kernel module nvidia.ko. Obviously because of the new kernel, of course.

What next? Downgrade maybe to avoid the xorg crash? From NVIDIA, I downloaded and install the latest legacy drivers NVIDIA-Linux-x86_64-390.87.run and I got my desktop back. Yeah, you can say: “why didn’t you do that to start with?“. Very simple answer also: I want to have homogeneous installations, not one machine with drivers version 390.87, the other one with 418.43. But I need to live with the fact that we are not all the same. unfortunately 😦

GPU performance – an overview

Since Christmas is coming, it’s time to think about what to upgrade and why. We want maybe to get better GPUs, but better for what? For me it is not clear yet the usage the people give to them. It is clear our people use them, but not how.

Let’s have a look to the previous and current generation NVIDIA high-end GPUs:

TITAN X (New) GTX 1080 GTX 980 Ti TITAN X (Old) GTX 980
NVIDIA GPU GP102 GP104 GM200 GM200 GM204
Architecture Pascal Pascal Maxwell Maxwell Maxwell
# Cores 3584 2560 2816 3072 2048
Core Clock 1417 MHz 1607 MHz 1000 MHz 1000 MHz 1126 MHz
Memory 12GB 8GB 6GB 12GB 4GB
Memory Clock 10000 MHz 10000 MHz 7000 MHz 7000 MHz 7000 MHz
Memory Interface 384-bit G5X 256-bit G5X 384-bit 384-bit 256-bit
Memory Bandwidth 480 GB/s 320 GB/s 336 GB/s 336 GB/s 224 GB/s
TDP 250 watts 180 watts 250 watts 250 watts 165 watts
Peak Compute 11.0 TFLOPS 8.2 TFLOPS 5.63 TFLOPS 6.14 TFLOPS 4.61 TFLOPS
Transistors
12B 7.2B 8.0B 8.0B 5.2B
Process Tech 16nm FinFET 16nm FinFET 28nm 28nm 28nm

What can you get from our table (original source here)?

There are a lot of parameters. I have some TITAN X, the new (Pascal) and the old (Maxwell). Also we have 1080’s. Pascal is the codename for the GPU architecture, and GP102 is the specific GPU model.

Let’s say that the new TITAN X Pascal has a GP102 GPU with 16 nm technology, holding 3584 cores with 12 GB of RAM. Then the old TITAN X Maxwell  has a GM200 GPU with 28 nm technology, 3072 cores and (also) 12 GB of RAM. Can we feel the difference? Memory clock and core clocks are also better for the Pascal version. According to this Linkedin article here, the performance factors are like this:

25 Times More Performance from Fermi to Kepler GPU

35 Times More Performance from Kepler to Maxwell GPU

10 Times More performance from Maxwell to Pascal GPU

Is it true in our case? We are not gaming, neither deep learning, so I’m going to say no, it is not true. If you want to get an hierarchy of GPUs for gaming, click here. What we do is use the GPU cores as “CPU cores”. As far as I know, our analysis code has been translated to work over GPU cores and take advantage of the number of processing units and native image processing capabilities. It was always mindboggling for me to think that we can use more than 3000 GPU “processing units” on a single machine with one of these cards, versus only 24 CPU cores, 48, 64 cores or similar.

Unfortunately I don’t know and I don’t measure if we use the maximum number of GPU cores available at a given moment, so I’m going to say that in our case the Peak Compute is the main parameter to pay attention to.

Therefore, for scientific calculations (no deep learning) on a Linux system, a Titan X Pascal should be almost TWICE faster than a Titan X Maxwell. Not 10 times faster, or maybe not in our case 😦

I hope this article helps. For more details you can check this Comparison of NVIDIA Tesla/Quadro and NVIDIA GeForce GPUs or even better, benchmark it yourself 🙂

CryoSPARC 2 update on CentOS 7

We have installed previously CryoSPARC 2, but since I’m kind of conservative about, I didn’t update it until I had a feature request from an user. The update is in principle quite easy. You just run cryosparcm update on the master. We have installed cryoSPARC over the user “user” and one master mymaster with computing power and on node myslave. This is my output, as usual formatted so that I don’t give you sensible information:

cryosparcm update
CryoSPARC current version v2.0.27
update starting on XXX CEST 2018
No version specified - updating to latest version.
=============================
Updating to version v2.3.2.
=============================
CryoSPARC is not already running.
If you would like to restart, use cryosparcm restart
Downloading master update...
% Total % Received % Xferd Average Speed Time Current
...stuff you see when downloading ...
100 460M 100 408:00 0:08:00 --:--:-- 4497k
Done.
Update will now be applied to the master installation, 
followed by worker installations on other node.
Deleting old files...
Extracting...
Done.
Updating dependencies...
Checking dependencies... 
Legacy hash (v2.2 & older) detected. 
No new python dependencies need to be installed.
Currently checking hash for mongodb
Dependencies for mongodb have changed - reinstalling...
Completed.
Completed dependency check. 
Successfully updated master from
version v2.0.27 to version v2.3.2.
Starting cryoSPARC System master process..
CryoSPARC is not already running.
database: started
command_core: started
cryosparc command core startup complete.
command_vis: started
command_proxy: started
webapp: started
--------------------------------------
CryoSPARC master started. 
From this machine, access the webapp at
http://localhost:39000
From other machines on the network, access at
http://mymaster.com:39000
Startup can take several minutes. 
Point your browser to the address
and refresh until you see the cryoSPARC web interface.
CryoSPARC is running.
Stopping cryosparc.
command_proxy: stopped
command_vis: stopped
webapp: stopped
command_core: stopped
database: stopped
Shut down
Starting cryoSPARC System master process..
CryoSPARC is not already running.
database: started
command_core: started
cryosparc command core startup complete.
command_vis: started
command_proxy: ERROR (spawn error)
webapp: started
--------------------------------------
CryoSPARC master started. 
From this machine, access the webapp at
http://localhost:39000
From other machines on the network, access at
http://mymaster.com:39000
Startup can take several minutes. 
Point your browser to the address
and refresh until you see the cryoSPARC web interface.
=======================================
Now updating worker nodes.
All workers: 
mymaster user@mymaster
myslave user@myslave
-------------------------------------------------
Updating worker mymaster
Remote update
scp ./cryosparc2_worker.tar.gz 
user@mymaster:/home/user/cryosparc2/cryosparc2_worker
ssh user@mymaster 
/home/user/cryosparc2/cryosparc2_worker/bin/cryosparcw update
Updating... checking versions
Current version v2.0.27 - New version v2.3.2 
=============================
Updating worker...
=============================
Deleting old files...
Extracting...
Done.
Updating dependencies...
Checking dependencies... 
Legacy hash (v2.2 & older) detected. 
No new python dependencies need to be installed.
Currently checking hash for ctffind
Dependencies for ctffind have changed - reinstalling...
Completed.
Completed dependency check. 
Successfully updated.
-------------------------------------------------
Updating worker myslave
Remote update
scp ./cryosparc2_worker.tar.gz 
user@myslave:/home/user/cryosparc2/cryosparc2_worker
ssh user@myslave 
/home/user/cryosparc2/cryosparc2_worker/bin/cryosparcw update
Updating... checking versions
Current version v2.3.2 - New version v2.3.2 
Already up to date
Done updating all worker nodes.
If any nodes failed to update, you can manually update them.
Cluster worker installations must be manually updated.
To update manually, simply copy the cryosparc2_worker.tar.gz
file into the cryosparc worker installation directory, 
and then run 
$ bin/cryosparcw update 
from inside the worker installation directory.

After all, the web is fine, but the users report errors like this one:

Traceback (most recent call last):
File "cryosparc2_compute/jobs/runcommon.py", 
line 738, in run_with_except_hook run_old(*args, **kw)
File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py",
line 92, in cryosparc2_compute.engine.cuda_core.GPUThread.run
File "cryosparc2_worker/cryosparc2_compute/engine/cuda_core.py", 
line 93, in cryosparc2_compute.engine.cuda_core.GPUThread.run
File "cryosparc2_worker/cryosparc2_compute/engine/engine.py", 
line 987, in cryosparc2_compute.engine.engine.process.work

The forum is the right place to look for errors. It looks like we have a cuda error. This is due to the fact that we have the cuda drivers on a “network” folder. I run the installer again on the worker, so that this time points to the “real” cuda drivers. For that we need to uninstall it first. Logged in on mymaster, we do it like this:

user@mymaster $ > cryosparcm cli 
'remove_scheduler_target_node("mymaster")'
user@mymaster $ > bin/cryosparcw connect 
--worker mymaster --master mymaster \
--ssdpath /home/user/cryosparc_ssd
-------------------------------------------------------
CRYOSPARC CONNECT --------------------------------------------
---------------------------------------------------------------
Attempting to register worker mymaster to command mymaster:39002
Connecting as unix user user
Will register using ssh string: user@mymaster
If this is incorrect, you should re-run this 
command with the flag --sshstr <ssh string> 
--------------------------------------------
Connected to master.
------------------------------------------
Current connected workers:
mymaster
myslave
--------------------------------------------------------
Autodetecting available GPUs...
Detected X CUDA devices.
id pci-bus name
----------------------------------------------
...list of CUDA devices ...
-----------------------------------------------
All devices will be enabled now. 
This can be changed later using --update
---------------------------------------------
Worker will be registered with SSD cache location 
/home/user/cryosparc_ssd 
--------------------------------------------------
Autodetecting the amount of RAM available...
This machine has XXX RAM .
---------------------------------------------------------------
Registering worker...
Done.
You can now launch jobs on the master node 
and they will be scheduled
on to this worker node if resource requirements are met.
-----------------------------------------------------
Final configuration for mymaster
lane : default
name : mymaster
title : Worker node mymaster
resource_slots : {u'GPU': [], u'RAM': []}
hostname : mymaster
worker_bin_path : 
/home/user/cryosparc2/cryosparc2_worker/bin/cryosparcw
cache_path : /home/user/cryosparc_ssd
cache_quota_mb : None
resource_fixed : {u'SSD': True}
cache_reserve_mb : 10000
type : node
ssh_str : user@mymaster
desc : None
--------------------------------------------------------

As usual, I removed the sensible info (in red). After reconnecting the workers, the cuda error is gone. We’ll see if it comes back, I hope not!