Slurm on CentOS 6

Published by Andrea on

Last week I’ve installed a Slurm cluster to manage a large set of computing nodes, Slurm is one of the most widely supported job scheduler for Linux and Unix.

It’s difficult to find on-line a CentOS 6 installation tutorial, it’s simplest to start with CentOS 7 distribution, but in my landscape, the servers are already configured.

I hope you can find helpful this article that cover Master and Cumpute Node installation and configuration.


This is a schema with commands, daemons and ports to open on your firewall.

Master & first computing node installation

Pay attention that in all the cluster nodes the uid and gid must be consistent.

create the munge user

export MUNGEUSER=991
groupadd -g $MUNGEUSER munge
useradd -m -c “MUNGE Uid ‘N’ ” -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge

create slurm user

export SLURMUSER=990 #imposto a 990 la variabile SLURMUSER
groupadd -g $SLURMUSER slurm #lo aggiungo al Gruppo slurm
useradd -m -c “SLURM workload manager” -d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm

install epel-release

yum install epel-release

install munge

yum install munge munge-libs munge-devel -y

install rng-tools

yum install rng-tools -y
rngd -r /dev/urandom

generate random key on master

/usr/sbin/create-munge-key -r
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key

copy the key on all computing nodes

scp /etc/munge/munge.key root@vm-calc-01:/etc/munge/munge.key
chown -R munge: /etc/munge/ /var/log/munge/
chmod 0700 /etc/munge/ /var/log/munge/

start munge

service munge start
chkconfig munge on
munge -n
munge -n | unmunge
munge -n | ssh vm-calc-01 unmunge

host benchmark 

remunge

install slurm package requirements

yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel man2html libibmad libibumad -y

download latest slurm

wget https://download.schedmd.com/slurm/slurm-18.08.3.tar.bz2

install Slurm

tar –bzip -x -f slurm*tar.bz2
cd slurm-18.08.3
./configure
make
make install

create slurm directory structure

mkdir /etc/slurm
touch /var/log/slurm.log
touch /var/log/SlurmctldLogFile.log
touch /var/log/SlurmdLogFile.log
mkdir /var/spool/slurmctld
chown slurm: /var/spool/slurmctld
chmod 755 /var/spool/slurmctld
mkdir /var/spool/slurmld
chown slurm: /var/spool/slurmd
chmod 755 /var/spool/slurmd
touch /var/log/slurmctld.log
chown slurm: /var/log/slurmctld.log
touch /var/log/slurm_jobacct.log
chown slurm: /var/log/slurm_jobacct.log

create slurm.conf on master

With the on-line tools at https://slurm.schedmd.com/configurator.easy.html you can generate the slurm.conf file

slurm.conf

slurm.conf file generated by configurator easy.html.
Put this file on all nodes of your cluster.
See the slurm.conf man page for more information.
#
SlurmctldHost=slurm-master

MailProg=/bin/mail
MpiDefault=none
MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/cgroup


TIMERS
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300


SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core


LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=
SlurmdDebug=3
SlurmdLogFile=


COMPUTE NODES
NodeName=vm-calc-01 CPUs=36 State=UNKNOWN
PartitionName=test Nodes=vm-calc-01 Default=YES MaxTime=INFINITE State=UP

vi /etc/slurm/slurm.conf

copy configuration to the computing nodes

scp /etc/slurm/slurm.conf root@vm-calc-01:/etc/slurm/slurm.conf

configuration test on master

slurmd -C

NodeName=slurm-master CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1ThreadsPerCore=1 RealMemory=32109

configuration test on node-01

slurmd -C

NodeName=vm-calc-01 CPUs=36 Boards=1 SocketsPerBoard=18CoresPerSocket=2 ThreadsPerCore=1 RealMemory=129059

to test the landscape I stop the firewall and disable security enhanced linux

service iptables stop
setenforce 0

create link to file

ln -s /etc/slurm/slurm.conf /usr/local/etc/slurm.conf

from https://slurm.schedmd.com/faq.html Other exceptions:

On Centos 6, also set “ProcessUnpackaged = yes” in the file /etc/abrt/abrt-action-save-package-data.conf.

vi /etc/abrt/abrt-action-save-package-data.conf

start control daemon on master

slurmctld

start slurm daemon on computing nodes

slurmd

verify configuration

sinfo

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
test* up infinite 1 idle
vm-calc-01

install cgroup

yum install libcgroup
service cgconfig start

vi /etc/slurm/cgroups.conf

cgroups.conf

CgroupMountpoint=”/cgroup”
CgroupAutomount=yes
CgroupReleaseAgentDir=”/etc/slurm/cgroup”
AllowedDevicesFile=”/etc/slurm/cgroup_allowed_devices_file.conf”
ConstrainCores=yes
TaskAffinity=no
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
ConstrainDevices=no
AllowedRamSpace=100
AllowedSwapSpace=0
MaxRAMPercent=100
MaxSwapPercent=100
MinRAMSpace=30

vi /etc/slurm/cgroup_allowed_device_file.conf

cgroup_allowed_devices_file.conf

/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu//
/dev/pts/*

ln -s /etc/slurm/cgroup.conf /usr/local/etc/cgroup.conf
ln -s /etc/slurm/cgroup_allowed_devices_file.conf /usr/local/etc/cgroup_allowed_devices_file.conf

create a batch test on master

vi submit.sh

submit.sh

!/bin/bash
#
SBATCH –job-name=test
SBATCH –output=res.txt
#
SBATCH –ntasks=1
SBATCH –time=10:00
SBATCH –output=slurm_%j.out
srun hostname
srun sleep 60

submit batch on master

sbatch submit.sh

verify queue on master

squeue

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
8 test submit.s root R 0:50 1 rm-vm-calc-01

the end is the new beginning

Remember to enable and configure firewall accordingly with the topology and the service auto start if apply (on rc.local), configure accounting and the master backup.