Slurm on CentOS 7

So it’s time to go for the cluster configuration on the M630 blades.  I’m going to follow the guide as found on the slothparadise blog. After a clean kickstart, I do only this on my nodes:

[root@node104 ~]# vi /etc/hosts
--> add IP and node name (only of the current one)
[root@node104 ~]# systemctl restart NetworkManager.service 
[root@node104 ~]# hostnamectl status
 Static hostname: node104.mydomain.edu
 Icon name: computer-server
 Chassis: server
 Machine ID: e0f48273519949d49232fd5678703314
 Boot ID: 895f584a936a4e3aadb4444d48a6ddfc
 Operating System: CentOS Linux 7 (Core)
 CPE OS Name: cpe:/o:centos:centos:7
 Kernel: Linux 3.10.0-327.el7.x86_64
 Architecture: x86-64
[root@node104 ~]# more /etc/resolv.conf
# Generated by NetworkManager
search mydomain.edu
nameserver 10.XXX.XXX.XXX

Now we go for the login node configuration. First I update all machines and disable selinux. Our login node (beta) will also manage the SlurlDB. So we install MariaDB:

yum install mariadb-server mariadb-devel -y

For all the nodes, before installing Slurm or Munge, we need to have the same user. Here we use different UIDs because one is already located.

[root@node105 ~]# export MUNGEUSER=991
[root@node105 ~]# groupadd -g $MUNGEUSER munge
[root@node105 ~]# useradd -m -c "MUNGE Uid 'N' Gid Emporium" 
-d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge
[root@node105 ~]# grep '992' /etc/passwd
chrony:x:994:992::/var/lib/chrony:/sbin/nologin
[root@node105 ~]# grep '990' /etc/passwd
[root@node105 ~]# export SLURMUSER=990
[root@node105 ~]# groupadd -g $SLURMUSER slurm
[root@node105 ~]# useradd -m -c "SLURM workload manager" 
-d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm

On  the login node and on the nodes we install munge. On the login node we create a munge key, and copy it to the nodes

yum install epel-release
yum install munge munge-libs munge-devel -y
yum install rng-tools -y
yum install rng-tools -y
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
 * base: mirror.23media.de
 * epel: mirror.23media.de
 * extras: mirror.23media.de
 * updates: mirror.23media.de
Package rng-tools-5-7.el7.x86_64 already installed and latest version
Nothing to do
 rngd -r /dev/urandom
[root@beta ~]# /usr/sbin/create-munge-key -r
Please type on the keyboard, echo move your mouse,
utilize the disks. This gives the random number generator
a better chance to gain enough entropy.
Generating a pseudo-random key using /dev/random completed.
[root@beta ~]# dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
1024+0 records in
1024+0 records out
1024 bytes (1.0 kB) copied, 0.00668166 s, 153 kB/s
[root@beta ~]# chown munge: /etc/munge/munge.key
[root@beta ~]# chmod 400 /etc/munge/munge.key
--> now install munge on the nodes 
root@sbbeta ~]# scp /etc/munge/munge.key 
  root@node105:/etc/munge/munge.key

Now we enable and start it (on the nodes and on the login node) and test it on the login node

root@beta ~]# chown -R munge: /etc/munge/ /var/log/munge/
[root@beta ~]# chmod 0700 /etc/munge/ /var/log/munge/
[root@beta ~]# systemctl enable munge
Created symlink from 
/etc/systemd/system/multi-user.target.wants/munge.service 
to /usr/lib/systemd/system/munge.service.
[root@beta ~]# systemctl start munge
[root@beta ~]# munge -n
MUNGE:here-some-random-chain-of-numbers-and-letters
[root@beta ~]# munge -n | unmunge
STATUS: Success (0)
ENCODE_HOST: beta (10.XXX.XX.XX)
ENCODE_TIME: 2016-08-22 14:46:42 +0200 (1471870002)
DECODE_TIME: 2016-08-22 14:46:42 +0200 (1471870002)
TTL: 300
CIPHER: aes128 (4)
MAC: sha1 (3)
ZIP: none (0)
UID: root (0)
GID: root (0)
LENGTH: 0
[root@beta ~]# munge -n | ssh node105 unmunge
root@node105's password: 
STATUS: Success (0)
ENCODE_HOST: beta.mydomain.edu (10.XXX.XXX.XXX)
ENCODE_TIME: 2016-08-22 14:47:00 +0200 (1471870020)
DECODE_TIME: 2016-08-22 14:47:09 +0200 (1471870029)
TTL: 300
CIPHER: aes128 (4)
MAC: sha1 (3)
ZIP: none (0)
UID: root (0)
GID: root (0)
LENGTH: 0
[root@beta ~]# remunge
2016-08-22 14:47:17 Spawning 1 thread for encoding
2016-08-22 14:47:17 Processing credentials for 1 second
2016-08-22 14:47:18 Processed 5193 credentials in 1.000s (5191 creds/sec)
[root@sbbeta ~]#

So munge looks fine. We go for the installation of slurm itself on the login node:

yum install openssl openssl-devel pam-devel 
numactl numactl-devel hwloc hwloc-devel 
lua lua-devel readline-devel rrdtool-devel 
ncurses-devel man2html libibmad libibumad -y

I need first the rpms:

[root@beta downloads]# rpmbuild -ta slurm-16.05.4.tar.bz2
error: Failed build dependencies:
 perl(ExtUtils::MakeMaker) is needed by slurm-16.05.4-1.el7.centos.x86_64

To correct that, I install cpanm:

[root@beta downloads]# yum install cpanm*
... some yum dump here
Transaction Summary
======================================================
Install 1 Package (+59 Dependent packages)
... and some more yum stuff here
root@beta downloads]# rpmbuild -ta slurm-16.05.4.tar.bz2
Executing(%prep): /bin/sh -e /var/tmp/rpm-tmp.xJn4oe
+ umask 022
+ cd /root/rpmbuild/BUILD
+ cd /root/rpmbuild/BUILD
+ rm -rf slurm-16.05.4
+ /usr/bin/bzip2 -dc /root/downloads/slurm-16.05.4.tar.bz

… it takes some time. Then I copy the packages to a safe place, and install the rpms. Note that the installation dependencies fail if you use wildcards on the installation. So:

root@beta slurm-rpms]# yum --nogpgcheck localinstall 
slurm-16.05.4-1.el7.centos.x86_64.rpm 
slurm-devel-16.05.4-1.el7.centos.x86_64.rpm 
slurm-munge-16.05.4-1.el7.centos.x86_64.rpm 
slurm-openlava-16.05.4-1.el7.centos.x86_64.rpm 
slurm-pam_slurm-16.05.4-1.el7.centos.x86_64.rpm 
slurm-perlapi-16.05.4-1.el7.centos.x86_64.rpm 
slurm-plugins-16.05.4-1.el7.centos.x86_64.rpm 
slurm-seff-16.05.4-1.el7.centos.x86_64.rpm 
slurm-sjobexit-16.05.4-1.el7.centos.x86_64.rpm 
slurm-sjstat-16.05.4-1.el7.centos.x86_64.rpm 
slurm-slurmdbd-16.05.4-1.el7.centos.x86_64.rpm 
slurm-slurmdb-direct-16.05.4-1.el7.centos.x86_64.rpm 
slurm-sql-16.05.4-1.el7.centos.x86_64.rpm 
slurm-torque-16.05.4-1.el7.centos.x86_64.rpm

And we’re close to the end. Now we generate the configuration file using the online tool provided here. We need to store the resulting slurm.conf on /etc/slurm/, copy it around, and test the configuration.

[root@beta log]# touch slurm.log
[root@beta log]# touch SlurmctldLogFile.log
[root@beta log]# touch SlurmdLogFile.log
[root@beta slurm]# scp slurm.conf root@sbnode105:/etc/slurm/
--> and so on for all the nodes
[root@beta ~]# mkdir /var/spool/slurmctld
[root@beta ~]# chown slurm: /var/spool/slurmctld
[root@beta ~]# chmod 755 /var/spool/slurmctld
[root@beta ~]# touch /var/log/slurmctld.log
[root@beta ~]# chown slurm: /var/log/slurmctld.log
--> do that also for the nodes
[root@beta ~]# touch /var/log/slurm_jobacct.log 
/var/log/slurm_jobcomp.log
[root@beta ~]# chown slurm: /var/log/slurm_jobacct.log 
/var/log/slurm_jobcomp.log
[root@beta ~]# slurmd -C
ClusterName=sbbeta NodeName=beta 
CPUs=8 Boards=1 SocketsPerBoard=2 
CoresPerSocket=4 ThreadsPerCore=1 
RealMemory=23934 TmpDisk=150101

So it looks fine. We stop the firewall everywhere

[root@beta ~]# systemctl stop firewalld.service
[root@beta ~]# systemctl disable firewalld.service

And start the slurm controller

root@beta /var/run ## > systemctl start slurmctld.service

We can get everywhere one of these errors:

Job for slurmctld.service failed because a configured resource limit 
was exceeded. See "systemctl status slurmctld.service" 
and "journalctl -xe" for details.

Or it can start but fail immediately afterwards:

systemctl status slurmctld.service 
● slurmctld.service - Slurm controller daemon
 Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
 Active: failed (Result: exit-code) since Thu 2016-08-25 15:39:50 CEST; 2s ago
 Process: 3070 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 3072 (code=exited, status=1/FAILURE)
'date' 'machine' systemd[1]: Starting Slurm controller daemon...
'date' 'machine' systemd[1]: Failed to read PID from file /var/run/slurmctld.pid: Invalid argument
'date' 'machine' systemd[1]: Started Slurm controller daemon.
'date' 'machine' systemd[1]: slurmctld.service: main process exited, code=exited, status=1/FAILURE
'date' 'machine' systemd[1]: Unit slurmctld.service entered failed state.
'date' 'machine' systemd[1]: slurmctld.service failed.

The solution I found is to create manually the PID file. Permissions are looking like:

-rw-r--r-- 1 slurm root 5 Aug 25 15:41 slurmctld.pid

Iif it works, we start slurm everywhere:

systemctl enable slurmd.service
systemctl start slurmd.service
systemctl status slurmd.service

There you have it! After an initial tuning, we found again the problem. From the log:

root@beta ~ ## > tail /var/log/SlurmctldLogFile.log
[date.eventnumber] fatal: Invalid SelectTypeParameters: NONE (0), You need at least CR_(CPU|CORE|SOCKET)*

So we changed SelectTypeParameters on the /etc/slurm/slurm.conf file, and restart the demon.

Advertisements

About bitsanddragons

A traveller, an IT professional and a casual writer
This entry was posted in bits, centos, linux, slurm. Bookmark the permalink.

One Response to Slurm on CentOS 7

  1. Pingback: Slurm controller configuration | Bits and Dragons

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s