GPFS : force deletion of a node

On the spirit of this being my notebook ❤️ I want to write down my experience forcing the deletion of a GPFS node. The disk of the node ‘deadnode‘ died without a previous warning (or was it that I wasn’t looking?) so I didn’t have a chance to properly stop the GPFS and so on. Let’s go and force the deletion

quorum ~ ## > mmdelnode -N deadnode
Verifying GPFS is stopped on all affected nodes ...
The authenticity of host 'deadnode ('
can't be established.
Are you sure you want to continue connecting (yes/no)? yes
deadnode: bash: /bin/ksh: No such file or directory
mmdelnode: Unable to confirm that GPFS is stopped
on all of the affected nodes.
Nodes should not be removed from the cluster
if GPFS is still running.
Make sure GPFS is down on all affected
nodes before continuing.
If not, this may cause a cluster outage.
mmdelnode: If the affected nodes are permanently down,
they can be deleted with the --force option.
mmdelnode: Command failed.
Examine previous error messages to determine cause.
quorum ~ ## > mmdelnode -N --force deadnode
mmdelnode: Incorrect extra argument: deadnode
mmdelnode {-a | -N {Node[,Node...] | NodeFile | NodeClass}}
quorum ~ ## > mmdelnode -N deadnode --force
Verifying GPFS is stopped on all affected nodes ...
deadnode: bash: /bin/ksh: No such file or directory
mmdelnode: Unable to confirm that GPFS is stopped
on all of the affected nodes.
Nodes should not be removed from the cluster
if GPFS is still running.
Make sure GPFS is down on all affected nodes before continuing.
If not, this may cause a cluster outage.
Do you want to continue? (yes/no) yes
mmdelnode: Command successfully completed
mmdelnode: Propagating the cluster configuration data
to all affected nodes.
This is an asynchronous process.

If you don’t know what’s a GPFS filesystem, just search on this very blog. There’s plenty of information above. Maybe too much information. Maybe too much. 😁😁😁

kernel:NMI watchdog: BUG: soft lockup on CentOS 7.X

I saw the soft lockup bug before, but I didn’t write the fix. On my AMD servers, sometimes I see printed on the shell a lot of the following messages:

May 25 07:23:59 XXXXXXX kernel: [13445315.881356] BUG: soft lockup – CPU#16 stuck for 23s! [yyyyyyy:81602]

The date, the CPU and the duration it’s stuck varies. Your analysis will still run, the machine is still usable, but the messages can become so annonying (specially if you have a lot of CPUs) that may not let you work. I found this post for SUSE that is really similar to what happens in CentOS, so I gave it a try. First I check that the watchdog_thresh is defined

hostname:~ # more /proc/sys/kernel/watchdog_thresh
hostname:~ # 10

Then I do as suggested:

hostname:~ # echo 20 > /proc/sys/kernel/watchdog_thresh

And so far the message is not coming back. Problem solved! 😊😊

ERROR: MediaWiki not showing up after a reboot, skin missing

The missing skin message

I like the motto If something works don’t touch it. Unfortunately in the era of CI/CD we need to touch it, and sometimes we even need to reboot it. This is what happened to me. There’s this nice MediaWiki 1.31 install that has been running since years in my team not as a docker or a fancy VM but as a real web server, with its httpd and mariadb services. I run weekly backups but I don’t update it precisely because I was afraid of it. For a reason. After a stupid power cut, when I managed to have the web server back online, I found out that the httpd and mariadb service were running but the MediaWiki was showing a Blank Page. As written on the link before, I go to my /var/www/html/ to edit the LocalSettings.php. Right at the top I copy:

error_reporting( E_ALL );
ini_set( 'display_errors', 1 );

and do systemctl restart httpd. The error instead of the Blank Page reads like this:

Fatal error: Uncaught Exception: /var/www/html/skins/MonoBook/skin.json 
does not exist! in /var/www/html/

and similar. I do check and for some weird reason (update? the other sysadmin?) the skin folder is gone. I then try to disable the skins by modifying the LocalSettings.php and restating the httpd. It works somehow. I get the wiki content unskinned, with an error on the top like this:

The wrong skin message

What to do now? I went for downloading the whole wiki zip again (the same version), copying the skins folder, uncommenting the LocalSettings.php and restating the httpd. And the wiki is back. What to do next? An invisible migration to a more reliable setup, maybe a kubernetes pod. We’ll see if I have time. Or mood. Or actually, both 😉.

Portainer on Kubernetes for CentOS 7.X

I already installed Portainer as a docker and I’m very happy about it. Is it the same sensation on a Kubernetes cluster? I need to deploy it to see it 😉. The official How-To is here. This my experience with it.

First I stomp over the pre-req Note. Since I have a baremetal Kubernetes cluster (that is, something I installed) I forgot to define the kubernetes storage classes. Here you have the kubernetes storage classes documentation. As you can expect, since it is a kubernetes feature, you can connect your cluster to a wide variety of storages, cloudly or not. I don’t want to speak about what I don’t know so I’ll simply go ahead and add from my Kubernetes Dashboard the yaml files for a storage class sc and for a persistent volume pv. Just press the “+” on the web and add to the yaml the next:

kind: StorageClass
name: local-storage
volumeBindingMode: WaitForFirstConsumer

This should create the sc called “local-storage“. I didn’t change a comma from the documentation. We can see it by typing kubeclt get sc. This is my output

root@kube ## > kubectl get sc
local-storage (default) Delete WaitForFirstConsumer false 13m

Sorry but there’s no way to format this right 😁😁. We will need also a local persistent volume. This is my yaml.

apiVersion: v1
kind: PersistentVolume
name: kube-local-pv
storage: 500Gi
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage
path: /my/storage/
- matchExpressions:
- key:
operator: In
- kube

I didn’t change so much with respect to the kubernetes local persistent volume documentation. Also I add it through the web, although you should have already realized how to do it from the command line. Now that we have our storages, we can go ahead and deploy. I try first through helm, but somehow it doesn’t seem to work. Anyway, here’s my output:

root@kube ~ ## > helm install -n portainer portainer portainer/portainer --set persistence.storageClass=kube-local-pv
NAME: portainer
LAST DEPLOYED: Thu Aug 26 14:16:11 2021
NAMESPACE: portainer
STATUS: deployed
Get the application URL by running these commands:
export NODE_PORT=$(kubectl get --namespace portainer -o jsonpath="{.spec.ports[0].nodePort}" services portainer)
export NODE_IP=$(kubectl get nodes --namespace portainer -o jsonpath="{.items[0].status.addresses[0].address}")
echo http://$NODE_IP:$NODE_PORT

Again, sorry for the editing. The above command is not like in the tutorial due to the changes in helm v3. I guess the kubernetes world is evolving pretty quickly, so to say. I don’t pass the namespace but I create it through command line.

kubectl create -f portainer.yaml

My portainer.yaml looks like this:

apiVersion: v1
kind: Namespace
name: portainer

Of course it will work also if you add that through the dashboard. Anyhow. It didn’t work with helm (the pod was always pending resources) so I went for a YAML manifest install. First we undo the deployment by deleten the namespace and the cluster role. We can do that from the dasboard or via command line. The command line cleanup looks like this:

root@kube ~ ## > kubectl delete namespace portainer
namespace "portainer" deleted
root@kube ~ ## > kubectl delete clusterrolebinding portainer "portainer" deleted

Then we deploy as NodePort. This is my output:

root@kube ~ ## > kubectl apply -n portainer -f
namespace/portainer created
serviceaccount/portainer-sa-clusteradmin created
persistentvolumeclaim/portainer created created
service/portainer created
deployment.apps/portainer created

After a few second, I can access to my portainer gui through the address kube:30777. Everything is pretty similar to the docker version, so ❤️❤️ success!

Rootless docker error on CentOS 7: failed to mount overlay: operation not permitted storage-driver=overlay2

While trying a rootless docker on my servers, I found a lot of issues. They reccomend to use an Ubuntu kernel, but I use CentOS 7.X. So I need to stick with it. The prerequisites are fine – I have newuidmap and newgidmap and enough subordinates. This is how it looks like when I run the rootless setup script as an user user.

user@server ~ $ > install
[ERROR] Missing system requirements.
[ERROR] Run the following commands to
[ERROR] install the requirements and run this tool again.

########## BEGIN ##########
sudo sh -eux <<EOF
# Set user.max_user_namespaces
cat <<EOT > /etc/sysctl.d/51-rootless.conf
user.max_user_namespaces = 28633
sysctl --system
# Add subuid entry for user
echo "user:100000:65536" >> /etc/subuid
# Add subgid entry for user
echo "user:100000:65536" >> /etc/subgid
########## END ##########

We go as root and cut and copy the thing above between the #. This is the output, edited.

root@server ~ ## > cut-and-copy-of-the-thing-above
+ cat
+ sysctl --system
* Applying /usr/lib/sysctl.d/00-system.conf ...
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0
* Applying /usr/lib/sysctl.d/10-default-yama-scope.conf ...
kernel.yama.ptrace_scope = 0
* Applying /usr/lib/sysctl.d/50-default.conf ...
kernel.sysrq = 16
kernel.core_uses_pid = 1
kernel.kptr_restrict = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.conf.default.promote_secondaries = 1
net.ipv4.conf.all.promote_secondaries = 1
fs.protected_hardlinks = 1
fs.protected_symlinks = 1
* Applying /etc/sysctl.d/51-rootless.conf ...
user.max_user_namespaces = 28633
* Applying /etc/sysctl.d/99-sysctl.conf ...
* Applying /etc/sysctl.conf ...
+ echo user:100000:65536
+ echo user:100000:65536
root@server ~ ## > ########## END ##########

Time to try again. The result gives no error but is not like in the tutorial. Here you have it.

user@server ~ $ > install
[INFO] systemd not detected,
needs to be started manually:
[INFO] Creating CLI context "rootless"
Successfully created context "rootless"
[INFO] Make sure the following environment variables
are set (or add them to ~/.bashrc):
export PATH=/usr/bin:$PATH
export DOCKER_HOST=unix:///run/user/3201/docker.sock
user@server ~ $ >

So what does it mean to start it manually? After reading this bug report, I decide to try to run it with the experimental tag and specifying the storage driver. This is my output, as usual, edited for a proper reading. Important messages in blue, comments in cursive, errors in red.

user@server ~ $ > \
--experimental --storage-driver overlay2
+ case "$1" in
+ '[' -w /run/user/USERID ']'
+ '[' -w /home/user ']'
--> some user-dependent messages...
+ exec dockerd --experimental --storage-driver overlay2
INFO[] Starting up
WARN[] Running experimental build
WARN[] Running in rootless mode.
This mode has feature limitations.
INFO[] Running with RootlessKit integration
...more messages here, loading plugins...
INFO[] skip loading plugin "io.containerd.snapshotter.v1.aufs"...
error="aufs is not supported: skip plugin"
INFO[] loading plugin "io.containerd.snapshotter.v1.devmapper"...
WARN[] failed to load plugin e
rror="devmapper not configured"

INFO[] loading plugins..
INFO[] skip loading plugin "io.containerd.snapshotter.v1.zfs"...
error="path must be a zfs : skip plugin"
WARN[] could not use snapshotter devmapper
in metadata plugin
error="devmapper not configured"

INFO[] metadata content store policy set policy=shared
INFO[] loading a lot of plugins sucessfully...
...more messages here, loading plugins...
INFO[] serving... address=/run/user/USERID/docker/containerd/sockets
INFO[] serving...
INFO[] containerd successfully booted in 0.033234s
WARN[] Could not set may_detach_mounts kernel parameter
error="error opening may_detach_mounts kernel config file:
open /proc/sys/fs/may_detach_mounts: permission denied"

INFO[] parsed scheme: "unix" module=grpc
...more messages here...
INFO[] ClientConn switching balancer to "pick_first" module=grpc
ERRO[] failed to mount overlay:
operation not permitted storage-driver=overlay2
INFO[] stopping event stream following graceful shutdown
error="context canceled"
module=libcontainerd namespace=plugins.moby
failed to start daemon:
error initializing graphdriver: driver not supported

[rootlesskit:child ] error:
command [/usr/bin/
--experimental --storage-driver overlay2]
exited: exit status 1

[rootlesskit:parent] error: child exited: exit status 1

What do I get from the above run? There are warnings on zfs, aufs, and finally overlay2 so it looks like there’s some kind of problem with the storage driver. You can get also a dark failed to register layer message or an error creating overlay mount. It makes sense, since I’m coming from a fully working root install. I try once more without the storage driver option, and this is the (interesting part of) the output.

ERRO[] failed to mount overlay: 
operation not permitted storage-driver=overlay2
ERRO[] AUFS cannot be used in non-init user namespace
ERRO[] failed to mount overlay: operation not permitted
INFO[] Attempting next endpoint for pull after error:
failed to register layer:
ApplyLayer exit status 1 stdout:
stderr: open /root/.bash_logout: permission denied

So if you don’t give a storage option, it tries them all. Mystery solved, I guess. You can have a look to the available overlayfs documentation (covering overlay and overlay2). But in short, the docker daemon running under user doesn’t manage to access to the storage drivers. Let’s have a look onto the docker storage options. Some documentation first. We know how to change the directory to store containers and images in Docker. In my CentOS 7.X, I see my daemon runs overlay2 and indeed the images after dowloaded are stored on /var/lib/docker/overlay2. I can change the docker image installation directory by editing /etc/docker/daemon.json. I add something like this.

"data-root": "/extrahd/docker",
"storage-driver": "overlay2"

I then clean up by docker system prune -a and restart my docker daemon still as root to be sure the newly downloaded images end up on /extrahd/docker. As expected. 😉. In principle the given location cannot be a GPFS or a CIFS mounted folder, or I end up getting all the driver errors.

Will this work for on rootless mode? In my case,it was not possible until I did the same trick as for root. So one needs to configure the user docker daemon. For my user user, it should be located on


Remember, of course, that the storage must be writable by the user user 😉😉. I hope now you can run docker rootless on CentOS 7.X as I can! 🤘😊.

Bonus: a docker storage using a bind mount (not tested) and how to control the docker with systemd, and the full docker daemon configuration file documentation.

Portainer, a docker GUI in a docker, on CentOS 7.X

The Portainer web. As taken from this tutorial.

It’s been a while since I started with the dockers and the kubernetes but so far I didn’t show you a docker management solution, only a docker usage cheat sheet. Well, forget about that, now I give you Portainer, a web that will allow you to check your images, volumes, and containers. Of course, running as a docker.

The image above I’ve taken from this portainer installation tutorial for Ubuntu. It includes a docker installation, but I expect you have already have dockers running. Anyway, once with dockers, the install is pretty simple. Just like this:

docker run -d -p 9000:9000 -v /var/run/docker.sock:/var/run/docker.sock portainer/portainer

After that, you should be able to access to the web (machinename:9000) and finish it up by creating the initial admin password. So far I’m very satisifed with the experience, but if you are not, you can check this comparison of docker GUIs. What’s next? Isolation, security and kubernetes integration. I’ll keep you posted…

A browser in a browser: a firefox docker

Different logos. Choose yours. Image taken from here.

I don’t trust flight companies. I have the impression that if I look for a flight and I find a good price but I don’t buy it on the spot, next time I come the price is gone. This may be because of the cookies, of it can be that the server instance actually saves the searches from the IP you are using, so that “for a better service” they can remember what you searched before. Or it can be because good prices fly away 😁😁😁.

In order to minimize this issue I have taken a radical approach. I’m using a firefox docker to search for the flight, a docker that I destroy after my search. If you have followed my blog you definitely have a user able to run a docker in your OS, whatever OS it is. I’ve tried several solutions, but the best so far is this one:

docker run -d --name=firefox -p 5800:5800 -v /docker/appdata/firefox:/config:rw --shm-size 2g jlesage/firefox

After this is done, open your browser (chrome, konqueror, safari, it doesn’ matter) and write your IP or machine name followed by :5800 to reach the firefox docker instance. You will see a browser in the browser, but it delivers. I feel now safer than running on incognito or similar. It’s fast, and it’s dirty. Original firefox docker image here.

If you want a full desktop on your browser, you can also have it. I wrote about it long time ago. Unfortunately, a docker-as-app solution when run as a user throws me a display error. Like this:

(firefox:1): Gtk-WARNING **: 11:35:33.496: 
Locale not supported by C library.
Using the fallback 'C' locale.
Unable to init server:
Broadway display type not supported:
Error: cannot open display:

As root, of course it runs fine, but we are not always root, aren’t we? 😉,

EDIT: If you want to completely delete firefox history you will need to delete also the appdata. To play completely on the safe side, do:

docker system prune -a;

/docker/appdata ## > rm -rf firefox/

Careful: system prune will remove ALL your docker stuff 😉

Deepmind and Alphafold for dummies (CentOS 7.X install)

It’s about time I’m asked to install Alphafold. We discussed last year about it on our group meetings and a lot of our scientist were quite skeptical about the idea. I was not, but I’m not really counted as a scientist. Time has helped to make the installation quite easy, but there were some edges I found following the installation guide that I’d like to log here. We start by installing docker and nvidia-docker2 on the target machine. As root.

##> yum remove docker\* # remove any old docker install
##> yum-config-manager --add-repo \
##> yum install docker-ce docker-ce-cli
##> distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L$distribution/nvidia-docker.repo | \
sudo tee /etc/yum.repos.d/nvidia-docker.repo
# ---> outputs the current repositories
##> DIST=$(sed -n 's/releasever=//p' /etc/yum.conf)
##> DIST=${DIST:-$(. /etc/os-release; echo $VERSION_ID)}
##> sudo yum makecache
# ---> nvidia runtime routines added
##> yum install nvidia-docker2
##> systemctl enable docker; systemctl start docker

We test that the given nvidia docker works…

##> docker run --runtime=nvidia --rm nvidia/cuda:11.4.0-devel-centos8 nvidia-smi

Now we should add user to our docker group

##>  usermod -aG docker $USER

and build the default alphafold image.

user ##> docker build -f docker/Dockerfile -t alphafold .
user ##> pip3 install -r docker/requirements.txt

Time to run. We need a FASTA file. We can get it from here. Seach for one protein, like methanocaldococcus, click on one entry, click on FASTA under the name. You should know this better than me. Copy the alphanumeric string 😉 to a new txt file, name it as test.fasta. We run over that file. Like this:

python3 docker/ --fasta_paths=test.fasta --max_template_date=2020-05-14

You will get then the docker log dumps on the current shell.

Problems: in principle it runs, and it produces a meaningful output if you know how to read it 😉. So don’t expect rotating structures like the ones above popping up 😁😁😁. Unfortunately with the real fasta I get quite frequently some python tensorflow errors. I have tried fiddling with the docker definition on docker/Dockerfile, but I’ve lost one day to end up nowhere. If you are in my position here you have possible modifications of the Dockerfile:

FROM nvidia/cuda:${CUDA}-cudnn8-runtime-ubuntu18.04

One can get the another CUDA supported tag. But be aware that tags like 11.4 may NOT end up with a sucessfully tagged image. The message you should get after the docker build build is: Successfully tagged alphafold:latest. If you don’t get it, something (cuda usually) is wrong or missing for the chosen set of parameters. My solution so far was to update the CUDA drivers on the host. Your solution can be better… let me know, please 😉.

Install and configure a VNC server on CentOS 7.X

A random image of a VNC viewer. Here the source.

I don’t need to give explanations to myself. Or do I? Let me remind me that I want to setup a VNC server as a first step to a slurm virtualization solution.It means I want to achieve a remote desktop solution that I can call as an user from a queuing system. I don’t know how to do that yet, but I do need a clear path to get a remote desktop to a CentOS 7.X server. Which is what you will find on this linuxize tutorial. Actually the tutorial is so good that I will not add anything to it. Just follow it. Then I will refer it on my next steps. If any. 😉

Amira Flexnet server reinstall

Previosly I have installed a flexnet server on a VM. Now the server has crashed (or something) I’ve decided to install everything again. This is my log.

root@amira:~# su amira
amira@amira:/root$ cd
amira@amira:~$ cd /usr/local/FlexNetLmadminInstaller/
amira@amira:/usr/local/FlexNetLmadminInstaller$ ls
sudo ./lmadmin-i86_lsb-11_14_0_0.bin
[sudo] password for amira:
Preparing to install...
Extracting the JRE from the installer archive...
Unpacking the JRE...
Extracting the installation resources from the installer archive...
Configuring the installer for this system's environment...

Launching installer...

Graphical installers are not supported by the VM.
The console mode will be used instead...
FlexNet Publisher License Server Manager
(created with InstallAnywhere)
Preparing CONSOLE Mode Installation...
InstallAnywhere will guide you through the installation
of FlexNet Publisher License Server Manager (lmadmin)
It is strongly recommended that you quit
all programs before continuing with this installation.
Respond to each prompt to proceed to the next step in the installation.
If you want to change something
on a previous step, type 'back'.
You may cancel this installation at any time by typing 'quit'.
Choose Install Folder
Only ASCII Characters Are Allowed In Installation Directory.
Where would you like to install?
Default Install Folder: /opt/FNPLicenseServerManager
There is already an installation of FlexNet
Publisher License Server Manager
(lmadmin). Do you want to overwrite?
(By default overwritten)
->1- YES
2- NO
Get User Input
Do you want to import files from previous installation?
(By default importing is disabled)
1- Import Installation files
Pre-Installation Summary
Please Review the Following Before Continuing:
Product Name:
FlexNet Publisher License Server Manager
Install Folder:
Disk Space Information (for Installation Target):
Required: 172,718,540 Bytes
Available: XXX Bytes
1- Yes to All
2- Yes
3- No
4- No to All
Do you want to overwrite the existing file?: 1
Launch Configuration
Configure the HTTP port number at which
the License Server Management Interface
can be accessed using a web browser.
Enter the HTTP Port Number (Default: 8090):
Launch Configuration
Configure the TCP/IP port number at which licensing application
will communicate with the License Server Manager
Enter the License Server Port Number (Default: 0):
Launch Configuration
Configure the cache timeout value for the updation of
count of activated licenses to be displayed on
web UI's dashboard.
Enter the cache timeout value (Default: 600):
Launch Configuration
Configure AlertTimeInterval value for updation of
alerts on web UI's dashboard.
Enter the alertTimeInterval value (Default: 600):
Launch Configuration
Configure the restart retry value for the maximum number
of times vendor daemon is attempted to be restarted if it goes down.
Enter the number of vendor daemon restart retries (Default: 10):
Installation Complete
FlexNet Publisher License Server Manager has been
successfully installed to:

I have pressed ENTER each time an option was given, and I’ve rewritten files when possible. The idea is to have the installation on the same place than before. Now we chown and start the server as written on the support document.

amira@amira:$ sudo chown -R amira /opt/FNPLicenseServerManager
amira@amira:$ /opt/FNPLicenseServerManager/lmadmin

The webpage http://amira:8090/systeminfo works. We get an image similar to the one above (taken also from the support document). We create the license.dat, change the admin password, and try to start the mclsmd. Since it doesn’t work, we copy the necessary files.

amira@amira:$ cp FlexNetLicenseServerTools/mcslmd /opt/FNPLicenseServerManager/
amira@amira:$ 13:15:05 (mcslmd) ERROR: vendor binary mcslmd not found - request to start ignored.

amira@amira:$ cp FlexNetLicenseServerTools/ /opt/FNPLicenseServerManager/
amira@amira-lic:/usr/local$ 13:17:48 (mcslmd) ERROR: vendor binary mcslmd not found - request to start ignored.

Sadly, that seems NOT to be enough. For some reason, mcslmd doesn’t run. As root I do

root@amira:~# chmod 777 /opt/FNPLicenseServerManager/mcslmd

and try reimporting the license from the web interface. With success! 😊😊😊. All the daemons run, I can open my previous amira install, and my users are happy 😊 😆 😀.

EXTRA: since we rely on a standard FLEXlm server, here you have the FLEXlm error codes.