CFDDFC on AWS: Creating a Cluster of Instances

Running CFDDFC with the AWS Console

[ AWS Setup | Launch in the Console | Launch with the CLI ]

Problem with these instructions? Please send a CFDDFC Support Query.

Why a Cluster?

The latest generation of compute-optimized (C5) instances provides the highest performing processors and the lowest price/performance of the EC2 instance types. The largest C5 instance includes 36 physical cores. A CFD simulation requiring a larger number of cores can be performed using a networked cluster of instances. For clusters, we recommend the C5n instances (launched in November 2018), which provide significantly higher network performance across all instance sizes, compared to the standard C5 instances. Note that the C5n instances are not eligible for free tier and the user will be charged while instances are running.

Creating a Cluster using the CFDDFC CLI

The easy way to create a cluster of instances running CFDDFC is to use the command line interface (CLI). The CLI provides a cluster subcommand enable the user to add secondary instances to a running main instance. The instructions on this page demonstrate creating a cluster of 4 instances. The cluster is formed of 1 main instance attached to the storage for the entire CFD simulation, and 3 secondary instances with minimal attached storage. Using the CFDDFC CLI, the user would first launch an instance with the launch subcommand, e.g. with c5n.18xlarge and 300 GB of storage:

cfddfc launch -instance c5n.18xlarge -volume 300

The user would wait until the instance is running; this is the main instance. The user can then create a cluster of 4 instances by adding 3 secondary instances using the cluster subcommand, e.g.

cfddfc cluster -secondary 3

Once the secondary instances are running, the cluster if fully operational.

Creating a Cluster Manually (alternative to CLI)

1. Cluster Security and Placement Groups (one time only)

A security group is required that allows network access between instances. To do this, first create a new security group named cluster.

From the left hand menu of the AWS console, select Security Groups.
Click Create Security Group.
In the Create Security Group panel, enter the Group Name cluster and a Description to remind you what this group does (e.g. “Cluster Group”).
Click Add Rule, and for Inbound, select SSH Type from the drop-down menu and My IP (or Anywhere) Source.
Click Create.

To allow instances within this security group to access each other, complete the following steps.

In Security Groups, select the cluster security group and for Inbound rules tab, click Edit.
Click Add Rule, selecting All TCP Type from the drop-down menu…
…then under Source, select Custom and begin typing sg- in the text box until a panel pops up, listing the security groups; select the cluster group itself.

Instances need to be connected through a low latency, high speed network. To ensure this, a placement group is needed to group instances within a single Availability Zone. The user should therefore create a placement group, also named cluster.

From the left hand menu of the AWS console, select Placement Groups.
Click Create Placement Group.
In the panel, enter the name cluster and click Create.

2. Launching Instances

First launch a single main instance by following the same cloud launch instructions as before but observing the following details.

Select the c5n.18xlarge instance (C5n largest size).
Under Instance Details, select cluster from the Placement Group menu.
Under Security Groups, check Select an existing security group and select the cluster group.
Under Storage, set a volume size that can accommodate the data for the intended simulations.
Click Launch.

Now launch the secondary instances, selecting minimal storage but using the same security and placement groups. All 3 secondary instances can be created at once by specifying 3 instances under Instance Details. Under Instance Details, the Auto-assign Public IP can be set to Disable, because the secondary nodes only need to be accessed from the main instance, not from the outside world.

3. SSH access to Secondary Instances (once per login session)

To run CFD in parallel using domain decomposition, the main instance needs to execute processes on the secondary instances. This is achieved by enabling SSH access from the main instance into the secondary instances, which requires the private key which the main instance does not have. However, SSH agent forwarding allows the private key on the local machine to be used, rather than storing it on the main instance (which is not advisable). To use agent forwarding, the private key must be added to to the authentication agent by the following command.

ssh-add ${HOME}/.ssh/ec2.pem

Following that, the user will find that SSH login to an instance no longer requires the key to be specified with the -i option. Agent forwarding is applied by logging into the main instance using the -A option, i.e.

ssh -A ubuntu@M.M.M.M

4. Sharing the Main Instance Volume (one time only)

This cluster is set up so that all data is stored on a volume attached to the main instance. We therefore share the OpenFOAM directory on the main instance across the secondary instances using the network file system (NFS) protocol. This involves exporting the OpenFOAM directory on the main instance, with exportfs (one time only), which is then mounted by all the secondary instances using mount.

To export the OpenFOAM directory: from the main instance, add the OpenFOAM directory to the /etc/exports file as superuser (sudo), export the file, then start the NFS server with the following terminal commands.

sudo sh -c "echo '/home/ubuntu/OpenFOAM  *(rw,sync,no_subtree_check)' >> /etc/exports"
sudo exportfs -ra
sudo service nfs-kernel-server start

5. Mounting the Main Volume from Secondary Instances

For each secondary instance we need to delete the empty directories in the OpenFOAM directory and use that directory as a mount point for mounting the OpenFOAM directory on the main instance. That requires the private IP address of the main instance which can be obtained from the AWS console, denoted here as L.L.L.L. We also need the private IP addresses of each secondary instances from the AWS console, denoted here as X.X.X.X, Y.Y.Y.Y and Z.Z.Z.Z. To simplify the process, define a SPIPS environment variable for the secondary private IPs before deleting and mounting directories using non-interactive SSH for all secondary instances by the following commands.

SPIPS="X.X.X.X Y.Y.Y.Y Z.Z.Z.Z"
for IP in $SPIPS ; do ssh $IP 'rm -rf ${HOME}/OpenFOAM/*' ; done
for IP in $SPIPS ; do ssh $IP 'sudo mount L.L.L.L:${HOME}/OpenFOAM ${HOME}/OpenFOAM' ; done

The mounting of the OpenFOAM directory on the main instance can be tested for each secondary instance by the following command.

for IP in $SPIPS ; do ssh $IP 'ls ${HOME}/OpenFOAM' ; done

For each secondary instance, it should return the ubuntu-〈version〉 directory, e.g. ubuntu-3.0.1.

6. Running in Parallel on a Cluster

Users can test the cluster on the damBreak tutorial case from the User Guide that simulates the collapse of the column of water under its own weight. The test involves:

changing to the $FOAM_RUN directory with the run alias;
copying the damBreak case files from the tutorials directory to the current directory;
changing to the damBreak case directory;
generating a mesh for the geometry with the blockMesh utility;
refine the mesh by splitting each (2D) cell 2×2 using the refineMesh utility;
creating the alpha.water field file from backup and initialising with the setFields utility;
decomposing the mesh and fields into 4 using the decomposePar utility;
running the interFoam solver in parallel with 4 processes.

First log onto the main instance using SSH with agent forwarding (-A).

ssh -A ubuntu@M.M.M.M

Then go into the run directory and execute all stages up to running the simulation in parallel with interFoam.

run
cp -r $FOAM_TUTORIALS/multiphase/interFoam/laminar/damBreak/damBreak .
cd damBreak
blockMesh
refineMesh -overwrite
cp -r 0/alpha.water.org 0/alpha.water
setFields
decomposePar

Parallel running across a cluster needs a list of host machines than the user wishes to use. In this example, the host machine names are the private IP addresses of the main and secondary instances. Open an editor and create such a file in the damBreak case directory, naming it machines, containing one IP address per line, e.g. as shown below (changing the addresses accordingly).

172.12.12.12
172.12.34.34
172.12.56.56
172.12.78.78

Finally, the user can execute interFoam using the foamJob script with the -p option for parallel running. foamJob -p will automatically run on a number cores indicated by the processor directories, using the host names in the machines file.

foamJob -p interFoam