Slurm on CentOS 7

So it’s time to go for the cluster configuration on the M630 blades. I’m going to follow the guide as found on the slothparadise blog. After a clean kickstart, I do only this on my nodes:

[root@node104 ~]# vi /etc/hosts
--> add IP and node name (only of the current one)
[root@node104 ~]# systemctl restart NetworkManager.service 
[root@node104 ~]# hostnamectl status
 Static hostname: node104.mydomain.edu
 Icon name: computer-server
 Chassis: server
 Machine ID: e0f48273519949d49232fd5678703314
 Boot ID: 895f584a936a4e3aadb4444d48a6ddfc
 Operating System: CentOS Linux 7 (Core)
 CPE OS Name: cpe:/o:centos:centos:7
 Kernel: Linux 3.10.0-327.el7.x86_64
 Architecture: x86-64
[root@node104 ~]# more /etc/resolv.conf
# Generated by NetworkManager
search mydomain.edu
nameserver 10.XXX.XXX.XXX

Now we go for the login node configuration. First I update all machines and disable selinux. Our login node (beta) will also manage the SlurlDB. So we install MariaDB:

yum install mariadb-server mariadb-devel -y

For all the nodes, before installing Slurm or Munge, we need to have the same user. Here we use different UIDs because one is already located.

[root@node105 ~]# export MUNGEUSER=991
[root@node105 ~]# groupadd -g $MUNGEUSER munge
[root@node105 ~]# useradd -m -c "MUNGE Uid 'N' Gid Emporium" 
-d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge
[root@node105 ~]# grep '992' /etc/passwd
chrony:x:994:992::/var/lib/chrony:/sbin/nologin
[root@node105 ~]# grep '990' /etc/passwd
[root@node105 ~]# export SLURMUSER=990
[root@node105 ~]# groupadd -g $SLURMUSER slurm
[root@node105 ~]# useradd -m -c "SLURM workload manager" 
-d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm

On the login node and on the nodes we install munge. On the login node we create a munge key, and copy it to the nodes

yum install epel-release
yum install munge munge-libs munge-devel -y
yum install rng-tools -y
yum install rng-tools -y
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
 * base: mirror.23media.de
 * epel: mirror.23media.de
 * extras: mirror.23media.de
 * updates: mirror.23media.de
Package rng-tools-5-7.el7.x86_64 already installed and latest version
Nothing to do
 rngd -r /dev/urandom
[root@beta ~]# /usr/sbin/create-munge-key -r
Please type on the keyboard, echo move your mouse,
utilize the disks. This gives the random number generator
a better chance to gain enough entropy.
Generating a pseudo-random key using /dev/random completed.
[root@beta ~]# dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
1024+0 records in
1024+0 records out
1024 bytes (1.0 kB) copied, 0.00668166 s, 153 kB/s
[root@beta ~]# chown munge: /etc/munge/munge.key
[root@beta ~]# chmod 400 /etc/munge/munge.key
--> now install munge on the nodes 
[root@beta ~]# scp /etc/munge/munge.key 
  root@node105:/etc/munge/munge.key

Now we enable and start it (on the nodes and on the login node) and test it on the login node

root@beta ~]# chown -R munge: /etc/munge/ /var/log/munge/
[root@beta ~]# chmod 0700 /etc/munge/ /var/log/munge/
[root@beta ~]# systemctl enable munge
Created symlink from 
/etc/systemd/system/multi-user.target.wants/munge.service 
to /usr/lib/systemd/system/munge.service.
[root@beta ~]# systemctl start munge
[root@beta ~]# munge -n
MUNGE:here-some-random-chain-of-numbers-and-letters
[root@beta ~]# munge -n | unmunge
STATUS: Success (0)
ENCODE_HOST: beta (10.XXX.XX.XX)
ENCODE_TIME: 2016-08-22 14:46:42 +0200 (1471870002)
DECODE_TIME: 2016-08-22 14:46:42 +0200 (1471870002)
TTL: 300
CIPHER: aes128 (4)
MAC: sha1 (3)
ZIP: none (0)
UID: root (0)
GID: root (0)
LENGTH: 0
[root@beta ~]# munge -n | ssh node105 unmunge
root@node105's password: 
STATUS: Success (0)
ENCODE_HOST: beta.mydomain.edu (10.XXX.XXX.XXX)
ENCODE_TIME: 2016-08-22 14:47:00 +0200 (1471870020)
DECODE_TIME: 2016-08-22 14:47:09 +0200 (1471870029)
TTL: 300
CIPHER: aes128 (4)
MAC: sha1 (3)
ZIP: none (0)
UID: root (0)
GID: root (0)
LENGTH: 0
[root@beta ~]# remunge
2016-08-22 14:47:17 Spawning 1 thread for encoding
2016-08-22 14:47:17 Processing credentials for 1 second
2016-08-22 14:47:18 Processed 5193 credentials in 1.000s 
(5191 creds/sec)

So munge looks fine. We go for the installation of slurm itself on the login node:

yum install openssl openssl-devel pam-devel 
numactl numactl-devel hwloc hwloc-devel 
lua lua-devel readline-devel rrdtool-devel 
ncurses-devel man2html libibmad libibumad -y

I need first the rpms:

[root@beta downloads]# rpmbuild -ta slurm-16.05.4.tar.bz2
error: Failed build dependencies:
 perl(ExtUtils::MakeMaker) is needed by 
slurm-16.05.4-1.el7.centos.x86_64

To correct that, I install cpanm:

[root@beta downloads]# yum install cpanm*
... some yum dump here
Transaction Summary
======================================================
Install 1 Package (+59 Dependent packages)
... and some more yum stuff here
root@beta downloads]# rpmbuild -ta slurm-16.05.4.tar.bz2
Executing(%prep): /bin/sh -e /var/tmp/rpm-tmp.xJn4oe
+ umask 022
+ cd /root/rpmbuild/BUILD
+ cd /root/rpmbuild/BUILD
+ rm -rf slurm-16.05.4
+ /usr/bin/bzip2 -dc /root/downloads/slurm-16.05.4.tar.bz

… it takes some time. Then I copy the packages to a safe place, and install the rpms. Note that the installation dependencies fail if you use wildcards on the installation. So:

root@beta slurm-rpms]# yum --nogpgcheck localinstall 
slurm-16.05.4-1.el7.centos.x86_64.rpm 
slurm-devel-16.05.4-1.el7.centos.x86_64.rpm 
slurm-munge-16.05.4-1.el7.centos.x86_64.rpm 
slurm-openlava-16.05.4-1.el7.centos.x86_64.rpm 
slurm-pam_slurm-16.05.4-1.el7.centos.x86_64.rpm 
slurm-perlapi-16.05.4-1.el7.centos.x86_64.rpm 
slurm-plugins-16.05.4-1.el7.centos.x86_64.rpm 
slurm-seff-16.05.4-1.el7.centos.x86_64.rpm 
slurm-sjobexit-16.05.4-1.el7.centos.x86_64.rpm 
slurm-sjstat-16.05.4-1.el7.centos.x86_64.rpm 
slurm-slurmdbd-16.05.4-1.el7.centos.x86_64.rpm 
slurm-slurmdb-direct-16.05.4-1.el7.centos.x86_64.rpm 
slurm-sql-16.05.4-1.el7.centos.x86_64.rpm 
slurm-torque-16.05.4-1.el7.centos.x86_64.rpm

And we’re close to the end. Now we generate the configuration file using the online tool provided here. We need to store the resulting slurm.conf on /etc/slurm/, copy it around, and test the configuration.

[root@beta log]# touch slurm.log
[root@beta log]# touch SlurmctldLogFile.log
[root@beta log]# touch SlurmdLogFile.log
[root@beta slurm]# scp slurm.conf root@sbnode105:/etc/slurm/
--> and so on for all the nodes
[root@beta ~]# mkdir /var/spool/slurmctld
[root@beta ~]# chown slurm: /var/spool/slurmctld
[root@beta ~]# chmod 755 /var/spool/slurmctld
[root@beta ~]# touch /var/log/slurmctld.log
[root@beta ~]# chown slurm: /var/log/slurmctld.log
--> do that also for the nodes
[root@beta ~]# touch /var/log/slurm_jobacct.log 
/var/log/slurm_jobcomp.log
[root@beta ~]# chown slurm: /var/log/slurm_jobacct.log 
/var/log/slurm_jobcomp.log
[root@beta ~]# slurmd -C
ClusterName=sbbeta NodeName=beta 
CPUs=8 Boards=1 SocketsPerBoard=2 
CoresPerSocket=4 ThreadsPerCore=1 
RealMemory=23934 TmpDisk=150101

So it looks fine. We stop the firewall everywhere

[root@beta ~]# systemctl stop firewalld.service
[root@beta ~]# systemctl disable firewalld.service

And start the slurm controller

root@beta /var/run ## > systemctl start slurmctld.service

We can get everywhere one of these errors:

Job for slurmctld.service failed because a configured resource 
limit was exceeded. See "systemctl status slurmctld.service" 
and "journalctl -xe" for details.

Or it can start but fail immediately afterwards:

systemctl status slurmctld.service 
● slurmctld.service - Slurm controller daemon
 Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; 
enabled; vendor preset: disabled)
 Active: failed (Result: exit-code) since XXX CEST; 2s ago
 Process: 3070 ExecStart=/usr/sbin/slurmctld 
$SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 3072 (code=exited, status=1/FAILURE)
'date' 'machine' systemd[1]: Starting Slurm controller daemon...
'date' 'machine' systemd[1]: Failed to read PID from 
file /var/run/slurmctld.pid: Invalid argument
'date' 'machine' systemd[1]: Started Slurm controller daemon.
'date' 'machine' systemd[1]: slurmctld.service: 
main process exited, code=exited, status=1/FAILURE
'date' 'machine' systemd[1]: Unit slurmctld.service 
entered failed state.
'date' 'machine' systemd[1]: slurmctld.service failed.

The solution I found is to create manually the PID file. Permissions are looking like:

-rw-r--r-- 1 slurm root 5 Aug 25 15:41 slurmctld.pid

Iif it works, we start slurm everywhere:

systemctl enable slurmd.service
systemctl start slurmd.service
systemctl status slurmd.service

There you have it! After an initial tuning, we found again the problem. From the log:

root@beta ~ ## > tail /var/log/SlurmctldLogFile.log
[date.eventnumber] fatal: Invalid SelectTypeParameters: NONE (0), You need at least CR_(CPU|CORE|SOCKET)*

So we changed SelectTypeParameters on the /etc/slurm/slurm.conf file, and restart the demon.

4 thoughts on “Slurm on CentOS 7”

Pingback: Slurm controller configuration | Bits and Dragons
Victor on December 17, 2017 at 12:27 am said:

what number assign to PID when create manually the PID file?

LikeLike

Reply ↓
- bitsanddragons on December 16, 2017 at 11:45 pm said:
  
  Currently it’s 4390. But I don’t think it’s important. The important thing is to have one. Please reboot also, I don’t remember very well this issue and I’m not sure that was the final solution. Also don’t forget to check the ownership and permission of the PID file. Let me know if it works!
  
  LikeLike
  
  Reply ↓
Pingback: Slurm 18.08 with QOS mariadb problems on CentOS 7 | Bits and Dragons

Bits and Dragons

adventures of an IT traveller

Slurm on CentOS 7

4 thoughts on “Slurm on CentOS 7”

Leave a comment Cancel reply

Share this:

Related

4 thoughts on “Slurm on CentOS 7”

Leave a comment Cancel reply