High-performance Computing with AWS Parallelcluster

Recent Posts

Popular Posts

May 7, 2019 5 minutes read

Last modified on December 2, 2020

cloud • hpc • devops

Overview

AWS ParallelCluster is a toolkit for automating the process of building, configuring, and managing clusters of virtual machines on the Amazon Elastic Compute Cloud (EC2) cloud. These clusters can be used similar to traditional HPC clusters, as illustrated below.

Operating principle of a traditional resource-manager based computing cluster.

The software is essentially a command-line tool written in Python that provides simple commands for creating, updating, stopping, starting, and deleting HPC clusters in the AWS EC2 cloud. Naturally, to provide these functionalities, the locally installed CLI has to use the web APIs of existing AWS services. The most important one of these services is AWS Cloudformation. It facilitates the creation and management of a collection of related AWS resources. A set of nodes belonging to a specific HPC cluster can be such a collection. Another essential service is Auto Scaling, providing dynamic elastic scaling of the number of compute nodes depending on the current workload. As a consequence, ParallelCluster can be configured to automatically add nodes when jobs are pending and not able to run due to the cluster being fully utilized.

Cloud elasticity allows automated addition and deletion of nodes depending on the current job queue size.

ParallelCluster is maintained directly by Amazon Web Services (AWS) and developed as an open source project accessible on GitHub. While the restriction to AWS is certainly disadvantageous, the official support from and for a specific cloud provider results in a stable and regularly updated product. The latter is especially important for any framework facilitating cluster computing in the cloud, because the tool has to be able to deal with new versions of operating systems, updates to cluster computing software and changes in the product palette of the supported cloud providers.

Performance Compared to On-Site Clusters

HPC in the cloud faces complex challenges that have to be overcome in order to compete with on-site clusters. Specifically, these include the interconnection networks between machines in the cloud datacenters, which are most often not built for supporting HPC applications, and secondly, a performance decrease due to virtualization.

Already in 2009, Napper et al. have used the LINPACK benchmark to assess the performance of HPC clusters allocated in the Amazon EC2 cloud. The sobering conclusion has been that, at least for communication intensive problems, HPC in the cloud can not yet make use of its potential strengths regarding scalability and cost saving. To the contrary, it has been shown that the achieved floating point operations per second (FLOPS) per dollar spent decrease exponentially as the allocated cluster grows. This is, as expected, due to the slow interconnection network and limited memory available on single nodes. One year later, in 2010, Jackson et al. have performed similar experiments and also encountered the familiar problem of slow network connections. They observed a clear correlation between the runtime of different distributed applications and the amount of time spent communicating. However, Jackson et al. have also shown that for applications requiring little communication between computing nodes, HPC cloud clusters can scale well.

The situation has since progressed further. Real world problems are often embarrassingly parallel or come close to it, enabling all benefits of cloud computing, even with sub-par interconnection networks. This development is illustrated by various case studies of different scientific fields, especially the computational life sciences. Novartis used the Amazon EC2 cloud with 87 000 on-demand CPU cores to vastly decrease computation time for a VS experiment, a scientific technique explained in detail in Section 2.5. HGST built a similarly strong cloud cluster of 70 000 cores to run a simulation for finding an optimal hard drive head design.

Getting Started

Perhaps the most important benefit of using a cloud cluster is the possibility to set it up in a couple of minutes. Provisioning a physical HPC cluster requires competent system administrators, air conditioned space for the machines, a large up-front hardware investement, etc. etc.

Getting started with AWS Parallelcluster only requires an AWS account and the ParallelCluster software. Detailed instructions for creating an account can be found at the official AWS website. The installation procedure for ParallelCluster is also well described at the official docs.

Before you can configure ParallelCluster, it is need to create an AWS Access Key. To do this, click on your Account name in the top right of the AWS management console browser application. There is an option My Security Credentials. Additionally, you should create an AWS EC2 Key Pair via Services -> EC2 -> Key Pairs -> Create Key Pair.

Finally, you can set up your local ParallelCluster installation. Simply type

pcluster configure

in a terminal window. ParallelCluster will ask for the access key you just created. This automated configuration process will further be able to retrieve default values for master_subnet_id and vpc_id. These are otherwise a little tedious to find out. In general, you can just press enter on all questions except aws_access_key_id and aws_secret_access_key. More sophisticated configuration can easily be applied later by directly modifying the file ~/.parallelcluster/config.

The following is a basic configuration for an elastic cluster with one to five compute nodes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


[aws]
aws_region_name = eu-central-1
aws_access_key_id = <AWS_ACCESS_KEY_ID>
aws_secret_access_key = <AWS_SECRET_ACCESS_KEY>

[cluster default]
vpc_settings = public
key_name = <AWS_KEY_NAME>
ebs_settings = custom
post_install = <POST_INSTALL_SCRIPT_DOWNLOAD_PATH>
base_os = ubuntu1604
maintain_initial_size = true
compute_instance_type = c4.2xlarge
initial_queue_size = 1
max_queue_size = 5

[vpc public]
master_subnet_id = <AWS_MASTER_SUBNET_ID>
vpc_id = <AWS_VPC_ID>

[global]
update_check = true
sanity_check = true
cluster_template = default

[ebs custom]
ebs_snapshot_id = <SNAPSHOT_ID>
volume_type = gp2

Once your configuration is finished, you can create your first cloud cluster with a simple:

pcluster create <cluster-name>

Naturally, for working with AWS and ParallelCluster, you will want to also look at the official documentation. In the following I list some of the most important resources:

Parts of this blog post have been taken from my thesis on Providing Transparent Remote Access to HPC Resources for Graphical Desktop Applications. If you are further interested in working with AWS Parallelcluster you can freely download it here. For a brief description of an in-production AWS Parallelcluster use case, check out the journal article that has emerged from the thesis.

Overview

Performance Compared to On-Site Clusters

Getting Started

Blog Post Tags