So it’s time to go for the cluster configuration on the M630 blades. I’m going to follow the guide as found on the slothparadise blog. After a clean kickstart, I do only this on my nodes:
[root@node104 ~]# vi /etc/hosts --> add IP and node name (only of the current one) [root@node104 ~]# systemctl restart NetworkManager.service [root@node104 ~]# hostnamectl status Static hostname: node104.mydomain.edu Icon name: computer-server Chassis: server Machine ID: e0f48273519949d49232fd5678703314 Boot ID: 895f584a936a4e3aadb4444d48a6ddfc Operating System: CentOS Linux 7 (Core) CPE OS Name: cpe:/o:centos:centos:7 Kernel: Linux 3.10.0-327.el7.x86_64 Architecture: x86-64 [root@node104 ~]# more /etc/resolv.conf # Generated by NetworkManager search mydomain.edu nameserver 10.XXX.XXX.XXX
Now we go for the login node configuration. First I update all machines and disable selinux. Our login node (beta) will also manage the SlurlDB. So we install MariaDB:
yum install mariadb-server mariadb-devel -y
For all the nodes, before installing Slurm or Munge, we need to have the same user. Here we use different UIDs because one is already located.
[root@node105 ~]# export MUNGEUSER=991 [root@node105 ~]# groupadd -g $MUNGEUSER munge [root@node105 ~]# useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge [root@node105 ~]# grep '992' /etc/passwd chrony:x:994:992::/var/lib/chrony:/sbin/nologin [root@node105 ~]# grep '990' /etc/passwd [root@node105 ~]# export SLURMUSER=990 [root@node105 ~]# groupadd -g $SLURMUSER slurm [root@node105 ~]# useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm
On the login node and on the nodes we install munge. On the login node we create a munge key, and copy it to the nodes
yum install epel-release yum install munge munge-libs munge-devel -y yum install rng-tools -y yum install rng-tools -y Loaded plugins: fastestmirror, langpacks Loading mirror speeds from cached hostfile * base: mirror.23media.de * epel: mirror.23media.de * extras: mirror.23media.de * updates: mirror.23media.de Package rng-tools-5-7.el7.x86_64 already installed and latest version Nothing to do rngd -r /dev/urandom [root@beta ~]# /usr/sbin/create-munge-key -r Please type on the keyboard, echo move your mouse, utilize the disks. This gives the random number generator a better chance to gain enough entropy. Generating a pseudo-random key using /dev/random completed. [root@beta ~]# dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key 1024+0 records in 1024+0 records out 1024 bytes (1.0 kB) copied, 0.00668166 s, 153 kB/s [root@beta ~]# chown munge: /etc/munge/munge.key [root@beta ~]# chmod 400 /etc/munge/munge.key --> now install munge on the nodes [root@beta ~]# scp /etc/munge/munge.key root@node105:/etc/munge/munge.key
Now we enable and start it (on the nodes and on the login node) and test it on the login node
root@beta ~]# chown -R munge: /etc/munge/ /var/log/munge/ [root@beta ~]# chmod 0700 /etc/munge/ /var/log/munge/ [root@beta ~]# systemctl enable munge Created symlink from /etc/systemd/system/multi-user.target.wants/munge.service to /usr/lib/systemd/system/munge.service. [root@beta ~]# systemctl start munge [root@beta ~]# munge -n MUNGE:here-some-random-chain-of-numbers-and-letters [root@beta ~]# munge -n | unmunge STATUS: Success (0) ENCODE_HOST: beta (10.XXX.XX.XX) ENCODE_TIME: 2016-08-22 14:46:42 +0200 (1471870002) DECODE_TIME: 2016-08-22 14:46:42 +0200 (1471870002) TTL: 300 CIPHER: aes128 (4) MAC: sha1 (3) ZIP: none (0) UID: root (0) GID: root (0) LENGTH: 0 [root@beta ~]# munge -n | ssh node105 unmunge root@node105's password: STATUS: Success (0) ENCODE_HOST: beta.mydomain.edu (10.XXX.XXX.XXX) ENCODE_TIME: 2016-08-22 14:47:00 +0200 (1471870020) DECODE_TIME: 2016-08-22 14:47:09 +0200 (1471870029) TTL: 300 CIPHER: aes128 (4) MAC: sha1 (3) ZIP: none (0) UID: root (0) GID: root (0) LENGTH: 0 [root@beta ~]# remunge 2016-08-22 14:47:17 Spawning 1 thread for encoding 2016-08-22 14:47:17 Processing credentials for 1 second 2016-08-22 14:47:18 Processed 5193 credentials in 1.000s (5191 creds/sec)
So munge looks fine. We go for the installation of slurm itself on the login node:
yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel man2html libibmad libibumad -y
I need first the rpms:
[root@beta downloads]# rpmbuild -ta slurm-16.05.4.tar.bz2 error: Failed build dependencies: perl(ExtUtils::MakeMaker) is needed by slurm-16.05.4-1.el7.centos.x86_64
To correct that, I install cpanm:
[root@beta downloads]# yum install cpanm* ... some yum dump here Transaction Summary ====================================================== Install 1 Package (+59 Dependent packages) ... and some more yum stuff here root@beta downloads]# rpmbuild -ta slurm-16.05.4.tar.bz2 Executing(%prep): /bin/sh -e /var/tmp/rpm-tmp.xJn4oe + umask 022 + cd /root/rpmbuild/BUILD + cd /root/rpmbuild/BUILD + rm -rf slurm-16.05.4 + /usr/bin/bzip2 -dc /root/downloads/slurm-16.05.4.tar.bz
… it takes some time. Then I copy the packages to a safe place, and install the rpms. Note that the installation dependencies fail if you use wildcards on the installation. So:
root@beta slurm-rpms]# yum --nogpgcheck localinstall slurm-16.05.4-1.el7.centos.x86_64.rpm slurm-devel-16.05.4-1.el7.centos.x86_64.rpm slurm-munge-16.05.4-1.el7.centos.x86_64.rpm slurm-openlava-16.05.4-1.el7.centos.x86_64.rpm slurm-pam_slurm-16.05.4-1.el7.centos.x86_64.rpm slurm-perlapi-16.05.4-1.el7.centos.x86_64.rpm slurm-plugins-16.05.4-1.el7.centos.x86_64.rpm slurm-seff-16.05.4-1.el7.centos.x86_64.rpm slurm-sjobexit-16.05.4-1.el7.centos.x86_64.rpm slurm-sjstat-16.05.4-1.el7.centos.x86_64.rpm slurm-slurmdbd-16.05.4-1.el7.centos.x86_64.rpm slurm-slurmdb-direct-16.05.4-1.el7.centos.x86_64.rpm slurm-sql-16.05.4-1.el7.centos.x86_64.rpm slurm-torque-16.05.4-1.el7.centos.x86_64.rpm
And we’re close to the end. Now we generate the configuration file using the online tool provided here. We need to store the resulting slurm.conf on /etc/slurm/, copy it around, and test the configuration.
[root@beta log]# touch slurm.log [root@beta log]# touch SlurmctldLogFile.log [root@beta log]# touch SlurmdLogFile.log [root@beta slurm]# scp slurm.conf root@sbnode105:/etc/slurm/ --> and so on for all the nodes [root@beta ~]# mkdir /var/spool/slurmctld [root@beta ~]# chown slurm: /var/spool/slurmctld [root@beta ~]# chmod 755 /var/spool/slurmctld [root@beta ~]# touch /var/log/slurmctld.log [root@beta ~]# chown slurm: /var/log/slurmctld.log --> do that also for the nodes [root@beta ~]# touch /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log [root@beta ~]# chown slurm: /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log [root@beta ~]# slurmd -C ClusterName=sbbeta NodeName=beta CPUs=8 Boards=1 SocketsPerBoard=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=23934 TmpDisk=150101
So it looks fine. We stop the firewall everywhere
[root@beta ~]# systemctl stop firewalld.service [root@beta ~]# systemctl disable firewalld.service
And start the slurm controller
root@beta /var/run ## > systemctl start slurmctld.service
We can get everywhere one of these errors:
Job for slurmctld.service failed because a configured resource limit was exceeded. See "systemctl status slurmctld.service" and "journalctl -xe" for details.
Or it can start but fail immediately afterwards:
systemctl status slurmctld.service ● slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since XXX CEST; 2s ago Process: 3070 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 3072 (code=exited, status=1/FAILURE) 'date' 'machine' systemd[1]: Starting Slurm controller daemon... 'date' 'machine' systemd[1]: Failed to read PID from file /var/run/slurmctld.pid: Invalid argument 'date' 'machine' systemd[1]: Started Slurm controller daemon. 'date' 'machine' systemd[1]: slurmctld.service: main process exited, code=exited, status=1/FAILURE 'date' 'machine' systemd[1]: Unit slurmctld.service entered failed state. 'date' 'machine' systemd[1]: slurmctld.service failed.
The solution I found is to create manually the PID file. Permissions are looking like:
-rw-r--r-- 1 slurm root 5 Aug 25 15:41 slurmctld.pid
Iif it works, we start slurm everywhere:
systemctl enable slurmd.service systemctl start slurmd.service systemctl status slurmd.service
There you have it! After an initial tuning, we found again the problem. From the log:
root@beta ~ ## > tail /var/log/SlurmctldLogFile.log [date.eventnumber] fatal: Invalid SelectTypeParameters: NONE (0), You need at least CR_(CPU|CORE|SOCKET)*
So we changed SelectTypeParameters on the /etc/slurm/slurm.conf file, and restart the demon.
Pingback: Slurm controller configuration | Bits and Dragons
what number assign to PID when create manually the PID file?
LikeLike
Currently it’s 4390. But I don’t think it’s important. The important thing is to have one. Please reboot also, I don’t remember very well this issue and I’m not sure that was the final solution. Also don’t forget to check the ownership and permission of the PID file. Let me know if it works!
LikeLike
Pingback: Slurm 18.08 with QOS mariadb problems on CentOS 7 | Bits and Dragons