Copy-paste ready commands to set up SGE, PBS/TORQUE, or SLURM clusters

Recent Posts Popular Posts

Introduction

Local HPC clusters continue to play a vital role in scientific research. Many universities, research institutions, and companies continue to maintain on-site clusters despite the efforts of the big cloud providers to expand their offerings into this area.

The software systems responsible for making these clusters of computers work together can be called Distributed Research Management System. The most commonly used ones are SGE, PBS/TORQUE, and SLURM. Unfortunately, setting those up can be very painful. There is a lot of outdated information on the internet. It is required to fight with several different configuration files and minor issues can mean that no command will run properly.

Operating principle of a traditional resource-manager based computing cluster.

Operating principle of a traditional resource-manager based computing cluster.

If your task is to set up a large system that will be used for many years and you have got several days to do this perfectly, this might not be a problem. However, if you just want to play around with the different systems, spending several hours until you can submit your first job is a very bleak outlook.

Therefore, I present copy-paste-ready instructions for setting up SGE, PBS/TORQUE, or SLURM on a single machine, which will act as the master and a compute node at the same time. All guides are tested on Ubuntu 18.04, but should work with little modification on most recent Linux installations. These commands can also serve to create container images or other automated setup workflows.

Naturally, caution is warranted when executing commands as a super user. If you apply the listed commands on a production system, please make sure you understand everything that is done.

Slurm

Trying to set up SLURM on my development machine in order to run some test suites that need to interface with SLURM cost me quite a bit of time. There are plenty of extensive tutorials available. In contrast to SGE and PBS, there is even an up-to-date official documentation. However, those resources cover much more than is needed for a single-node cluster and it is easy to get lost in the many configuration options that you will not need at first.

Fortunately, I found this thread in the Ubuntu forums, which is the basis for my SLURM instructions.

Start by installing munge and slurm:

sudo apt install munge slurm-wlm

Then, create a new file at /etc/slurm-llnl/slurm.conf with the following content:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
ControlMachine=<YOUR-HOST-NAME>

MpiDefault=none

ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/builtin
SelectType=select/linear
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=<YOUR-HOST-NAME>
JobAcctGatherType=jobacct_gather/none
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#
# COMPUTE NODES
NodeName=<YOUR-HOST-NAME> CPUs=4 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN
PartitionName=long Nodes=<YOUR-HOST-NAME> Default=YES MaxTime=INFINITE State=UP

You only need to replace <YOUR-HOST-NAME> with your actual hostname. Simply execute the hostname command if you are not sure what it is. Of course, you can also modify other settings, such as the number of CPUs, but the basic configuration from above should be enough for our 1-node cluster.

We are almost finished, just enable and start the manager slurmctld:

sudo systemctl enable slurmctld
sudo systemctl start slurmctld

Finally, enable and start the agent slurmd:

sudo systemctl enable slurmd
sudo systemctl start slurmd

Congratulations, your Slurm system should be up an running! Use sinfo to check the status of the manager and the agent. The command scontrol show node will give you information about your node setup.

If Slurm did not start, fear not, there is likely only a small fix required. Start by looking for an error message with

systemctl status slurmd.service

Does it say that the slurmd.pid file coud not be opened? If so, your slurm.conf file probably has different values for SlurmctldPidFile and SlurmdPidFile than your slurmctld.service file. Check by printing the latter:

less /usr/lib/systemd/system/slurmctld.service

It might show PIDFile=/run/slurmctld.pid, which means that we need to change SlurmctldPidFile and SlurmdPidFile in our /etc/slurm-llnl/slurm.conf to the following:

SlurmctldPidFile=/run/slurmctld.pid
SlurmdPidFile=/run/slurmd.pid

After that, try again to start slurmctld and slurmd with the above commands.

SGE

This one is not so simple.

Some parts of the following setup are taken from a public Dockerfile created by Robert Syme.

First, gain root permissions. On Ubuntu you can type sudo -i. All commands have to be executed with root permissions.

Then, create a new folder, let’s say via mkdir /opt/sge/installfolder. Set the new folder’s location as an environment variable in the current shell via

export INSTALLFOLDER=/opt/sge/installfolder.

Then, execute the following commands:

cd $INSTALLFOLDER
wget https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/sge-common_8.1.9_all.deb .
wget https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/sge-doc_8.1.9_all.deb .
wget https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/sge_8.1.9_amd64.deb .
dpkg -i ./*.deb

Then, download the following 4 files and place them also into the created folder, in our case into /opt/sge/installfolder:

  1. sge_init.sh
  2. sge_auto_install.conf
  3. sge_hostgrp.conf
  4. sge_exec_host.conf

Most of the magic will happen through those scripts and configuration files. After the download, we need to set some environment variables in the current shell:

export SGE_ROOT=/opt/sge
export SGE_CELL=default

We also need to set a new profile.d config via

ln -s $SGE_ROOT/$SGE_CELL/common/settings.sh /etc/profile.d/sge_settings.sh

Then, execute the following to install SGE and perform setup operations:

cd $SGE_ROOT && ./inst_sge -m -x -s -auto $INSTALLFOLDER/sge_auto_install.conf \\
&& sleep 10 \\
&& /etc/init.d/sgemaster.sge-cluster restart \\
&& /etc/init.d/sgeexecd.sge-cluster restart \\
&& sed -i "s/HOSTNAME/`hostname`/" $INSTALLFOLDER/sge_exec_host.conf \\
&& sed -i "s/HOSTNAME/`hostname`/" $INSTALLFOLDER/sge_hostgrp.conf \\
&& /opt/sge/bin/lx-amd64/qconf -Me $INSTALLFOLDER/sge_exec_host.conf

Now, our new cluster is already up and running. However, we still need to add users to the sgeusers group, which was defined in the sge_hostgrp.conf file you just applied. Only users from this group are allowed to submit jobs. Therefore, we run the following:

/opt/sge/bin/lx-amd64/qconf -au <USER> sgeusers

Finally, run $INSTALLFOLDER/sge_init.sh.

You can now delete the complete $INSTALLFOLDER, or at least run

rm $INSTALLFOLDER/*.deb

After a restart to load the new profile.d settings, all users that were added to sgeusers should be able to submit jobs. Test this via:

echo "echo Running test from $HOSTNAME" | qsub

PBS / Torque

This is a little simpler again. Most of the setup is taken from another blog. Some inspiration is also from a public Dockerfile available at Docker Hub.

First, gain root permissions. On Ubuntu you can type sudo -i. All commands have to be executed with root permissions.

We start by installing the relevant packages:

apt-get install torque-server torque-client torque-mom torque-pam

Installing these packages will create a default setup. Unfortunately, this is complex and would require complex changes to get to a working cluster. Instead, we stop all torque services and create a clean setup:

/etc/init.d/torque-mom stop
/etc/init.d/torque-scheduler stop
/etc/init.d/torque-server stop
pbs_server -t create
killall pbs_server

We start by setting localhost as the server host and allowing root to change the database configuration:

echo localhost > /etc/torque/server_name
echo localhost > /var/spool/torque/server_priv/acl_svr/acl_hosts
echo root@localhost > /var/spool/torque/server_priv/acl_svr/operators
echo root@localhost > /var/spool/torque/server_priv/acl_svr/managers

With the following commands, we set localhost also as compute node with 4 cores available:

echo "SERVER.DOMAIN np=4" > /var/spool/torque/server_priv/nodes
echo localhost > /var/spool/torque/mom_priv/config

Now, we can already start the daemon processes again:

/etc/init.d/torque-server start
/etc/init.d/torque-scheduler start
/etc/init.d/torque-mom start

After this qmgr is ready to start the scheduler:

qmgr -c 'set server scheduling = true'
qmgr -c 'set server keep_completed = 300'
qmgr -c 'set server mom_job_sync = true'

The following commands are to create a default queue and to configure it:

qmgr -c 'create queue batch'
qmgr -c 'set queue batch queue_type = execution'
qmgr -c 'set queue batch started = true'
qmgr -c 'set queue batch enabled = true'
qmgr -c 'set queue batch resources_default.walltime = 3:00:00'
qmgr -c 'set queue batch resources_default.nodes = 1'
qmgr -c 'set server default_queue = batch'

Finally, we allow our machine to submit to the new cluster:

qmgr -c 'set server submit_hosts = <hostname>'
qmgr -c 'set server allow_node_submit = true'

To test the setup, you can use the following command:

echo "echo Running test from $HOSTNAME" | qsub

Conclusion

The target audience for this post is probably very small. I decided to do it anyway, as there might be someone that can benefit immensely. Personally, I would have saved a lot of time if I found such a blog post a year ago. Also, I might need it again myself at some point.


comments powered by Disqus

  Blog Post Tags