[ERROR CRI]: container runtime is not running while kubeadm init on CentOS 7.X

More on kubernetes. When I try to initialize kubeadm, I get the next error:

# > kubeadm init
[init] Using Kubernetes version: v1.25.3
[preflight] Running pre-flight checks
error execution phase preflight:
[preflight] Some fatal errors occurred:
[ERROR CRI]: container runtime is not running:
output: E1019 15:49:03.837827 19294 remote_runtime.go:948]
"Status from runtime service failed"
err="rpc error: code = Unimplemented
desc = unknown service runtime.v1alpha2.RuntimeService"
time="XXXX" level=fatal
msg="getting status of runtime: rpc error:
code = Unimplemented desc = unknown service
runtime.v1alpha2.RuntimeService", error: exit status 1
[preflight] If you know what you are doing,
you can make a check non-fatal with
To see the stack trace of this error execute
with --v=5 or higher

The solution I found, after a few start and stop of services and deletion of a few files seems to be like this – sucessful output included

## > rm /etc/containerd/config.toml
## > systemctl restart containerd
## > kubeadm init
[init] Using Kubernetes version: v1.25.3
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up
a Kubernetes cluster
[preflight] This might take a minute or two,
depending on the speed of your internet connection
[preflight] You can also perform this action in
beforehand using 'kubeadm config images pull'
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [
kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local MASTERNODENAME]
and IPs [ONE_IP MY_IP]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for
DNS names [localhost MASTERNODENAME] and
IPs [MY_IP ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for
DNS names [localhost MASTERNODENAME] and
IPs [MY_IP ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[apiclient] All control plane components are healthy after 8.003061 seconds
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config" in namespace kube-system with the configuration for the kubelets in the cluster
[upload-certs] Skipping phase. Please see --upload-certs
[mark-control-plane] Marking the node MASTER as control-plane by adding the labels: [node-role.kubernetes.io/control-plane node.kubernetes.io/exclude-from-external-load-balancers]
[mark-control-plane] Marking the node MASTER as control-plane by adding the taints [node-role.kubernetes.io/control-plane:NoSchedule]
[bootstrap-token] Using token: TOKEM
[bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to get nodes
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] Configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] Configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[bootstrap-token] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[kubelet-finalize] Updating "/etc/kubernetes/kubelet.conf" to point to a rotatable kubelet client certificate and key
[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy
Your Kubernetes control-plane has initialized successfully!
To start using your cluster,
you need to run the following as a regular user:

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

Alternatively, if you are the root user, you can run:

export KUBECONFIG=/etc/kubernetes/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join MY-IP:6443 \
--token TOKEN \
--discovery-token-ca-cert-hash HASH

That’s it. BTW, another post with the solution to the issue. Let’s get going with my kubernetes…


Slurm_gpustat : a command line to check GPU usage on a slurm cluster.

I’m still looking for the best (and easiest) tool to find out if we need to buy more GPUs, and of which type. One of the tools I found, gpuview, I have described already in detail. Unfortunately, when the number of GPUs grow, the web dashboard becomes very crowded, and there’s no (obvious) way to customize it. Let’s see what this tool does.

The program is python-based, so the installation is not complicated. In my case, I was forced just to install Pillow beforehand (pip3 install Pillow). Here you have my output, as usual, edited to remove the meaningless information.

## > pip3 install slurm_gpustat
WARNING: Running pip install with root privileges
is generally not a good idea.
Try `pip3 install --user` instead.
Collecting slurm_gpustat
Using cached ..slurm_gpustat-0.0.12-py3-none-any.whl
Requirement already satisfied: beartype (from slurm_gpustat)
... package recollection here ...
Installing collected packages: colored, kiwisolver,
cycler, pyparsing, matplotlib, seaborn, typing-extensions,
zipp, importlib-metadata, humanize, slurm-gpustat
Running setup.py install for colored ... done
Successfully installed colored-1.4.3 cycler-0.11.0
humanize-3.14.0 importlib-metadata-4.8.3
kiwisolver-1.3.1 matplotlib-3.3.4 pyparsing-3.0.9
seaborn-0.11.2 slurm-gpustat-0.0.12
typing-extensions-4.1.1 zipp-3.6.0

No problem here. I open a new shell and run it right away. The output is satisfactory, similar to the one of the github page. I mock it a little:

Under SLURM management
There are a total of 52 gpus [up]
4 A100 gpus
6 rtxtitan gpus
8 A40 gpus
34 gtx1080 gpus
There are a total of 50 gpus [accessible]
4 A100 gpus
6 rtxtitan gpus
8 A40 gpus
32 gtx1080 gpus
Usage by user:
There are 50 gpus available:
gtx1080: 32 available
rtxtitan: 6 available
A40: 8 available
A100: 4 available

This is very good but it raises some questions. What’s the meaning of up vs accessible vs available? I understand available means not busy, or not reserved. We can confirm that by running the command in verbose mode. I’m not going to copy my full output but a part of it, that took me a little to understand. Thje command slurm_gpustat –verbose prints sometimes a misterious message:

There are 50 gpus available:
Missing information for gpunode10: AllocTRES, skipping....
Missing information for gpunode11: AllocTRES, skipping....
Missing information for gpunode12: AllocTRES, skipping....
-> gpunode09: 2 gtx1080
[cpu: 15/40, gres/gpu: 2/2,
gres/gpu:gtx1080: 6/2, mem: 50 GB/500 GB] []
-> gpunode10: 2 gtx1080 [] []
-> gpunode11: 2 gtx1080 [] []
-> gpunode12: 2 gtx1080 [] []

Why is there missing alloctres information? Very simple in principle: the node resources are not booked. If you book the node (with srun, for example) the message will dissapear. So don’t worry about it.

HOWTO: SLURM + GRAFANA + PROMETHEUS deployment with ansible

I have posted before some ansible tips. Now we are going to use ansilble to deploy SLURM together with a grafana and prometheus instaces. I’m following this NVIDIA deepops tutorial.

  1. Get the code git clone https://github.com/NVIDIA/deepops.git
  2. Run the installer ./scripts/setup.sh
  3. Make an ansible inventory (on config/inventory)
  4. Install it ansible-playbook -l slurm-cluster playbooks/slurm-cluster.yml

That’s it, You see? Pretty easy… provided you have the knowledge of it. That I had not before, and I have installed SLURM with a traditional script instead. By ssh to the node, run the script, check, etc. Oh the technological changes are so amazing sometimes…

SLURM HOWTO submit a process as another user

I don’t need to do it, since being root I coud su username and become the user. But let’s say I want to script it. My root script will submit periodically (each day) a job to the SLURM cluster as an specific AD user, that I will call slurmuser. We don’t want to use root because the script doesn’t need to run under root privileges. For this example, we are going just to get the kernel info on a node with one gpu. The command should look like this:

root@login ~ ##> su slurmuser -c 'srun -N 1 
--partition=mine -n 1 -- gres=gpu:1
--qos=normal --pty uname -r'

After that we should get something like 3.10.0-1160.45.1.el7.x86_64. Why is this useful? Well at this moment it may not be, but I have plans. I have plans. ๐Ÿ˜.

SLURM update to 21.08 on CentOS 7.X

It’s been a while since I did this, but my guidelines are still working. Anyhow I think it’s important to log my experience.

First I tried updating my current install with yum – it works but you don’t come to the version we want. Therefore I go and download the latest stable version. Over it, without unzip or untar I will build my rpm packages, as I previously did.

Oddly enough the command rpmbuild was not present. Maybe I cleaned it by mistake, it doesn’t matter. I get it through yum, yum install rpm-build. After it, as before, I typ rpmbuild -ta slurm*.tar.bz2 to compile and generate the rpm packages. It takes some time and the packages will end on a special folder – read the output to find out but in my case it was on /root/rpmbuild/RPMS/x86_64/. I copy the generated rpm onto a network folder, to have them available for all my nodes.

Time to install the rpms. Since I need to do the same a lot of times, I prepare a small on-line script with all the packages. It looks like this:

rpm --install slurm-21.08.8-2.el7.x86_64.rpm \
slurm-contribs-21.08.8-2.el7.x86_64.rpm \
slurm-devel-21.08.8-2.el7.x86_64.rpm \
slurm-libpmi-21.08.8-2.el7.x86_64.rpm \
slurm-openlava-21.08.8-2.el7.x86_64.rpm \
slurm-pam_slurm-21.08.8-2.el7.x86_64.rpm \
slurm-perlapi-21.08.8-2.el7.x86_64.rpm \
slurm-libpmi-21.08.8-2.el7.x86_64.rpm \
slurm-slurmctld-21.08.8-2.el7.x86_64.rpm \
slurm-slurmd-21.08.8-2.el7.x86_64.rpm \
slurm-slurmdbd-21.08.8-2.el7.x86_64.rpm \

As I wrote, we have it already SLURM installed in the nodes. I stop the daemons, then I chase the old packages and remove them until the above script run without conflicts. Just in case, copy your slurm.conf, slurmdbd.conf and gres.conf for every node.

Once done I try to start the new daemons. Unfortunately it doesn’t work out of the box, and with the worst error possible, a database error. Is the cluster not registered? sacctmgr show cluster shows an error similar to this:

sacctmgr: error: 
failed to open persistent connection to host:DATABASEHOST:6819:
Connection refused
sacctmgr: error: Sending PersistInit msg:
Connection refused

What is going on? I restart mariadb systemctl restart mariadb, then I try again. Here you have the slurm-user thread about a similar error. Since the past is not important for me, I decide to go ahead and rebuild the accounting configuration. You need to create a cluster, then an account to manage it, then create a user linked with that account. I have heard of slurm AD integration, but let’s leave that for the next time, OK? Here you have the documentation page of sacctmgr with examples of how to do everything needed.

I try to submit a job as an use, but the database daemon is still complaining.

Table 'slurm_acct_db.local_job_table' doesn't exist

We can check the table structure, and fix it if needed. After we are happy, we try again.

error: mysql_real_connect failed: 
1045 Access denied for user 'slurm'@'localhost'
(using password: YES)

So no way. But is this really a database permission? After quite some time messing with it, I decide to check the system permissions themselves. I go to /var/spool/, performย chown slurm: slurm/ย so that the folder is owned by slurm and restart all the daemons everywhere (slurmctld, slurmdbd, slurmd). And I can submit jobs again. I managed! ๐Ÿค˜๐Ÿค˜๐Ÿค˜

Failed to start slurmcltd.service: Unit not found while moving the SLURM daemon from one host to another

Things get broken, things get old. It’s time to move the SLURM controller from one server to another. If you don’t know about SLURM, check this out, for example. My new server has, of course, the same kernel, the same SLURM user id, and it can munge/ remunge as it should. First I drain all the nodes. Like this:

for node in node{101..190} node20{1..9}; do 
    echo $node; 
    scontrol update nodename=$node state=DRAIN reason="move"

Then I goto the new master node, update the slurm configuration file (slurm.conf) parameters ControlMachine and BackupController (if you have) so that they point to the new name(s) and try to start only the SLURM controller on that master node. Since I didn’t install the master before, the output reads:

## > systemctl start slurmcltd
Failed to start slurmcltd.service: Unit not found.

Let’s install as much SLURM as possible with yum (yum install slurm*) and try again. Unfortunately, the daemon unit is still not found. The error message is the same. Is it there?

## > ls /usr/sbin/slurm*
## > ls /usr/lib/systemd/system/slurm*

So the binary and the service are in place. What’s going on? We start it with the “full” name.

## > systemctl start slurmctld.service
## > systemctl status slurmctld.service
? slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service;
disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since XXX
Process: 13001
ExecStart=/usr/sbin/slurmctld -D $SLURMCTLD_OPTIONS
code=exited, status=1/FAILURE)
Main PID: 13001 (code=exited, status=1/FAILURE)
XXX systemd[1]: Started Slurm controller daemon.
XXX systemd[1]: slurmctld.service:
main process exited, code=exited, status=1/FAILURE
XXX systemd[1]: Unit slurmctld.service entered failed state.
XXX systemd[1]: slurmctld.service failed.

Time to look at the logs. Or should have I had done that from the beginning? I have the slurm controller daemon defined in the slurm.conf. It points to this one:

/var/log ## > more SlurmctldLogFile.log
XXS debug: Log file re-opened
XXX debug: sched: slurmctld starting
XXX fatal: mkdir(/var/spool/slurm/save): Permission denied

We go to /var/spool/slurm/ and check that the folder save is there. It isn’t! I make it, assign ownership, and try again.

/var/spool/slurm ## > mkdir save
/var/spool/slurm ## > chown slurm: save/

Now it runs. And we can submit and so on. And I can happily switch off the old ControlMachine and BackupController. I hope ๐Ÿคž.

SLURM job listed on squeue as running by “nobody”

Back to work. Unfortunately I can’t live (yet?) only in my dreams. Also, I need to wake up to go to the office, despite of the pandemic. That is becoming an endemic. Whatever. So I have this new AD user (alias newuser) that so far didn’t perform any computing but is starting to play with our SLURM cluster. To see how a new user is performing I tend to go to the accounting node and check his status. To my surprise, it looks similar to this:

root@accounting ~ ## > squeue

.... ย  ย  ย  ย  ย 
84534 one JobA aduser R ย 13:19:31ย  ย  ย  4 n1,n2,n3
84538 one JobB nobodyย  R 6:22:58 ย  ย  1 n4

Well, definitely I don’t have a nobody user. And I’m not alone with the issue. It seems to work anyway, so I’m tempted to leave it, but I’m curious. Where’s the problem? Is this on the account managing? At the very beginning I was adding the users by hand, one by one, later on, I’ve used AD integration. Anyway, I’m not installing SLURM all the time, so I don’t remember how I did the user integration. I then look for a user updating tool just to find this that gives me the hint. I know the AD name. Is the user “visible” on the UNIX system, id should report it. Note the next:

root@accounting ~ ## > id nobody
uid=99(nobody) gid=99(nobody) groups=99(nobody)
root@accounting ~ ## > id newuser
id: newuser: no such user

So there’s a nobody but not a newuser. But I know newuser is nobody. What kind of dark magic is this? When you don’t know someone, the best is… to google him ๐Ÿ˜๐Ÿ˜. But when a computer doesn’t know someone, it has a problem with the keytab or the sssd. First I start checking the /etc/sssd/sssd.conf on accounting, to be sure it has no error (it had) then I create a new kerberos key (or copy the one in backup) and restart the sssd service. After the last action, nobody becomes the newuser. Or something like that. You know what I mean ๐Ÿ˜‰. Have a nice Monday, everyone!

sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host

Yeah this is the kind of title that everyone loves. I was trying to get the slurm accounting information as an user when I encountered this error. I didn’t perform last update, so I didn’t check that everything was working. I only checked that people were using the queue and that there was a certain server load. Here you have the full command and the error:

user@lognode $ > sacct -S 2020-02-01 --

sacct: error: slurm_persist_conn_open_without_init:
failed to open persistent connection to host:XXX:6819:
Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused

As root, I go to where the SLURM database daemon is supposed to be running. The daemon systemctl status slurmdbd shows that it failed. Is the database daemon running? Yes it is, systemctl status mariadb.service shows active(running). What are the logs saying? Actually there’s no log entry of the Slurmdbd since the slurm update. I remove/rename my database log and try to restart the database daemon slurmdbd, but the log doesn’t register any new activity. What is going on? Let’s run the daemon manually like this:

/usr/sbin/slurmdbd -Dvv

This seems to be more meaningful. The message says the slurm database configuration file slurmdbd.conf is not owned by slurm (root !=slurm is the colourful message) and it doesn’t have the right permissions. We do as suggested.

chown slurm:slurm slurmdbd.conf
chmod 600 slurmdbd.conf
systemctl start slurmdbd

After that,my slurmdbdemon works just fine. And I get my account info. And my database logs. And I can check my previous jobs. This is a sample output, formatted, and blurred of course so that I don’t reveal relevant information ๐Ÿ™‚

User  JobID  JobName  Partition  NodeList State    Start End
user 1234 test.sh p.main node001 COMPLETED date1 date2
user 1233 test.sh p.main node001 CANCELLED date1 date2
user 1232 test.sh p.main node001 FAILED date1 date2
user 1231 test.sh p.main node001 CANCELLED date1 date2
user 1230 test.sh p.main node001 COMPLETED date1 date2

And so on and so on and so on. It’s hard to fix something if you don’t know if it works ๐Ÿ™‚

Error: Unable to register: Zero Bytes were transmitted or received on Slurm 20.02.4 CentOS 7.9

This is my second post after touching the slurm configuration to add a few new nodes. I have already prepared scripts that are doing most of the install. Precisely because of that I ended up with the error on the title. After a new install as described in my previous post, the slurmd daemon runs on the node but the login node is unable to communicate with it.

I checked it all: passwordless ssh, configuration file, library versions, ntp service (time stamp) and rebooting of both the login and the troublesome node but the error still appears on my node daemon log. A careful monitoring of the daemon log using tailf gives this result:

root@organic ## > tailf /var/log/MySlurmDaemonLogFile.log
[2021-02-03T11:06:49.117] error: Unable to register: 
Zero Bytes were transmitted or received
[2021-02-03T11:06:49.392] error: 
Munge decode failed: Invalid credential
[2021-02-03T11:06:49.392] ENCODED: Wed Dec 31 19:00:00 1969
[2021-02-03T11:06:49.392] DECODED: Wed Dec 31 19:00:00 1969
[2021-02-03T11:06:49.392] error: 
REQUEST_NODE_REGISTRATION_STATUS has authentication error: 
Invalid authentication credential
[2021-02-03T11:06:49.392] error: 
Protocol authentication error
[2021-02-03T11:06:49.403] error: 
service_connection: slurm_receive_msg: 
Protocol authentication error

This time I didn’t edit the date since I believe it has no sense to hide it. It’s a munge problem, looks like. I perform a munge/unmuge test from the node and indeed the unmunge of a remote node fails. How to fix it? Don’t hurry up uninstalling things! Very simple.

root@organic ## > rsync -av root@login:/etc/munge/ 
/etc/munge/ --delete-after --progress
receiving file list ... 
2 files to consider
          1,024 100% 1000.00kB/s    0:00:00 (xfr#1, to-chk=0/2)
sent 57 bytes  received 1,156 bytes  808.67 bytes/sec
total size is 1,024  speedup is 0.84
root@organic ## > systemctl start munge

I believe you don’t need more explanations, after that, the new organic node will report itself as idle to the queue. No more reboots, no more service restarts. Problem solved. I hope!

Slurm 20.02.4 install and installing errors on CentOS 7.9

If you don’t know SLURM (A Simple Linux Utility for Resource Management) you should. The 20.02 of the title is not a recent version (the latest slurm at this moment is 20.11.13 but I need to install it on my newly added slurm nodes. Why a post? I have saved the RPMs from the previous installation, but the install on the new guys doesn’t finish due to an infiniband library error, I believe, that I don’t want to solve at this point, since the infiniband fabric is not yet in place. Or maybe the error comes from the fact I built the packages for CentOS 7.7, I don’t know. I decide them to build the RPMs again. First we need to install the missing packages:

yum install python3 readline-devel 
perl pam-devel perl-ExtUtils\* 
mariadb-server mariadb-devel

Now we just follow the instructions.

  1. rpmbuild -ta slurm*.tar.bz2
  2. rpm --install <the rpm files>
  3. systemctl enable slurmd
  4. systemctl start slurmd

I don’t want to insult your intelligence extending the tutorial. Run the rpm install on the place the rpm packages end up. The official documentation is here. The non-official one is here. Some errors (for the record) and how to fix them:

Processing files: slurm-slurmdbd-20.02.4-1.el7.x86_64
error: File not found: /root/rpmbuild/BUILDROOT/slurm-20.02.4-1.el7.x86_64/usr/lib64/slurm/accounting_storage_mysql.so

RPM build errors:
    File not found: /root/rpmbuild/BUILDROOT/slurm-20.02.4-1.el7.x86_64/usr/lib64/slurm/accounting_storage_mysql.so

FIX: install mariadb-server mariadb-devel. Maybe you forgot to do it on the previous step?

error: Failed build dependencies:
	perl(ExtUtils::MakeMaker) is needed by slurm-20.02.4-1.el7.x86_64

FIX: install perl-ExtUtils\*. Again, read above.

Once more, just a reminder. This is my log in addition to a blog, and I like to leave references to my struggles. I’m happy if it helps you, but that’s not my point ๐Ÿ™‚