Copy-paste ready commands to set up SGE, PBS/TORQUE, or SLURM clusters
- Why Keyboard Shortcuts don't work on non-US Layouts and how Devs could fix it
- An Interactive Virtual Keyboard to Visualize any Collection of Shortcuts
- An app to show the shortcuts of the current application for Windows, Linux, and macOS
- Automatically add <kbd>-tags with a Single Regex
Introduction
Local HPC clusters continue to play a vital role in scientific research. Many universities, research institutions, and companies continue to maintain on-site clusters despite the efforts of the big cloud providers to expand their offerings into this area.
The software systems responsible for making these clusters of computers work together can be called Distributed Research Management System. The most commonly used ones are SGE, PBS/TORQUE, and SLURM. Unfortunately, setting those up can be very painful. There is a lot of outdated information on the internet. It is required to fight with several different configuration files and minor issues can mean that no command will run properly.
If your task is to set up a large system that will be used for many years and you have got several days to do this perfectly, this might not be a problem. However, if you just want to play around with the different systems, spending several hours until you can submit your first job is a very bleak outlook.
Therefore, I present copy-paste-ready instructions for setting up SGE, PBS/TORQUE, or SLURM on a single machine, which will act as the master and a compute node at the same time. All guides are tested on Ubuntu 18.04, but should work with little modification on most recent Linux installations. These commands can also serve to create container images or other automated setup workflows.
Naturally, caution is warranted when executing commands as a super user. If you apply the listed commands on a production system, please make sure you understand everything that is done.
Slurm
Trying to set up SLURM on my development machine in order to run some test suites that need to interface with SLURM cost me quite a bit of time. There are plenty of extensive tutorials available. In contrast to SGE and PBS, there is even an up-to-date official documentation. However, those resources cover much more than is needed for a single-node cluster and it is easy to get lost in the many configuration options that you will not need at first.
Fortunately, I found this thread in the Ubuntu forums, which is the basis for my SLURM instructions.
Start by installing munge and slurm:
sudo apt install munge slurm-wlm
Then, create a new file at /etc/slurm-llnl/slurm.conf
with the following content:
|
|
You only need to replace <YOUR-HOST-NAME>
with your actual hostname. Simply execute the hostname
command if you are not sure what it is. Of course, you can also modify other settings, such as the number of CPUs, but the basic configuration from above should be enough for our 1-node cluster.
We are almost finished, just enable and start the manager slurmctld:
sudo systemctl enable slurmctld
sudo systemctl start slurmctld
Finally, enable and start the agent slurmd
:
sudo systemctl enable slurmd
sudo systemctl start slurmd
Congratulations, your Slurm system should be up an running! Use sinfo
to check the status of the manager and the agent. The command scontrol show node
will give you information about your node setup.
If Slurm did not start, fear not, there is likely only a small fix required. Start by looking for an error message with
systemctl status slurmd.service
Does it say that the slurmd.pid
file coud not be opened? If so, your slurm.conf
file probably has different values for SlurmctldPidFile
and SlurmdPidFile
than your slurmctld.service
file. Check by printing the latter:
less /usr/lib/systemd/system/slurmctld.service
It might show PIDFile=/run/slurmctld.pid
, which means that we need to change SlurmctldPidFile
and SlurmdPidFile
in our /etc/slurm-llnl/slurm.conf
to the following:
SlurmctldPidFile=/run/slurmctld.pid
SlurmdPidFile=/run/slurmd.pid
After that, try again to start slurmctld
and slurmd
with the above commands.
SGE
This one is not so simple.
Some parts of the following setup are taken from a public Dockerfile created by Robert Syme.
First, gain root permissions. On Ubuntu you can type sudo -i
. All commands have to be executed with root permissions.
Then, create a new folder, let’s say via mkdir /opt/sge/installfolder
. Set the new folder’s location as an environment variable in the current shell via
export INSTALLFOLDER=/opt/sge/installfolder.
Then, execute the following commands:
cd $INSTALLFOLDER
wget https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/sge-common_8.1.9_all.deb .
wget https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/sge-doc_8.1.9_all.deb .
wget https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/sge_8.1.9_amd64.deb .
dpkg -i ./*.deb
Then, download the following 4 files and place them also into the created folder, in our case into /opt/sge/installfolder
:
Most of the magic will happen through those scripts and configuration files. After the download, we need to set some environment variables in the current shell:
export SGE_ROOT=/opt/sge
export SGE_CELL=default
We also need to set a new profile.d
config via
ln -s $SGE_ROOT/$SGE_CELL/common/settings.sh /etc/profile.d/sge_settings.sh
Then, execute the following to install SGE and perform setup operations:
cd $SGE_ROOT && ./inst_sge -m -x -s -auto $INSTALLFOLDER/sge_auto_install.conf \\
&& sleep 10 \\
&& /etc/init.d/sgemaster.sge-cluster restart \\
&& /etc/init.d/sgeexecd.sge-cluster restart \\
&& sed -i "s/HOSTNAME/`hostname`/" $INSTALLFOLDER/sge_exec_host.conf \\
&& sed -i "s/HOSTNAME/`hostname`/" $INSTALLFOLDER/sge_hostgrp.conf \\
&& /opt/sge/bin/lx-amd64/qconf -Me $INSTALLFOLDER/sge_exec_host.conf
Now, our new cluster is already up and running. However, we still need to add users to the sgeusers
group, which was defined in the sge_hostgrp.conf
file you just applied. Only users from this group are allowed to submit jobs. Therefore, we run the following:
/opt/sge/bin/lx-amd64/qconf -au <USER> sgeusers
Finally, run $INSTALLFOLDER/sge_init.sh
.
You can now delete the complete $INSTALLFOLDER
, or at least run
rm $INSTALLFOLDER/*.deb
After a restart to load the new profile.d
settings, all users that were added to sgeusers
should be able to submit jobs. Test this via:
echo "echo Running test from $HOSTNAME" | qsub
PBS / Torque
This is a little simpler again. Most of the setup is taken from another blog. Some inspiration is also from a public Dockerfile available at Docker Hub.
First, gain root permissions. On Ubuntu you can type sudo -i
. All commands have to be executed with root permissions.
We start by installing the relevant packages:
apt-get install torque-server torque-client torque-mom torque-pam
Installing these packages will create a default setup. Unfortunately, this is complex and would require complex changes to get to a working cluster. Instead, we stop all torque services and create a clean setup:
/etc/init.d/torque-mom stop
/etc/init.d/torque-scheduler stop
/etc/init.d/torque-server stop
pbs_server -t create
killall pbs_server
We start by setting localhost
as the server host and allowing root to change the database configuration:
echo localhost > /etc/torque/server_name
echo localhost > /var/spool/torque/server_priv/acl_svr/acl_hosts
echo root@localhost > /var/spool/torque/server_priv/acl_svr/operators
echo root@localhost > /var/spool/torque/server_priv/acl_svr/managers
With the following commands, we set localhost
also as compute node with 4 cores available:
echo "SERVER.DOMAIN np=4" > /var/spool/torque/server_priv/nodes
echo localhost > /var/spool/torque/mom_priv/config
Now, we can already start the daemon processes again:
/etc/init.d/torque-server start
/etc/init.d/torque-scheduler start
/etc/init.d/torque-mom start
After this qmgr
is ready to start the scheduler:
qmgr -c 'set server scheduling = true'
qmgr -c 'set server keep_completed = 300'
qmgr -c 'set server mom_job_sync = true'
The following commands are to create a default queue and to configure it:
qmgr -c 'create queue batch'
qmgr -c 'set queue batch queue_type = execution'
qmgr -c 'set queue batch started = true'
qmgr -c 'set queue batch enabled = true'
qmgr -c 'set queue batch resources_default.walltime = 3:00:00'
qmgr -c 'set queue batch resources_default.nodes = 1'
qmgr -c 'set server default_queue = batch'
Finally, we allow our machine to submit to the new cluster:
qmgr -c 'set server submit_hosts = <hostname>'
qmgr -c 'set server allow_node_submit = true'
To test the setup, you can use the following command:
echo "echo Running test from $HOSTNAME" | qsub
Conclusion
The target audience for this post is probably very small. I decided to do it anyway, as there might be someone that can benefit immensely. Personally, I would have saved a lot of time if I found such a blog post a year ago. Also, I might need it again myself at some point.