Auto Scale a Cluster.

Create a cluster. In this example we will use the following configuration:

  1. Ubuntu 18.04
  2. Slurm scheduler
  3. c5.large instance types
  4. 1 Min & 100 compute nodes as the maximum.
While you wait for your cluster to be built (~5 to 10 minutes) go ahead and install our desktop application RONIN LINK for connecting to the cluster and its monitoring tools.

Once your machine is up and running, open with RONIN LINK and launch a terminal or shh into the machine.

Now we are on our cluster, take a breath, you just launched an auto-scaling cluster in about 8 minutes. Its all yours and you will be the only one in the queue...ah nice. Surely that's worth a tweet :)

Ok, lets just jump into our shared apps directory. This will be shared across all the compute nodes when they are launched by the auto scaling.

cd /apps/

Spack is installed on all our clusters and provides a loads of applications for more info see the spack website.

Now install the Stress application via Spack. Spack will install this application and all its dependencies in the shared "/apps" folder...how awesome is that?

spack install stress

Create our "stress" script. This will be our "job" we run in the cluster.

vim stress.sh

Paste the following into the file and save it:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
spack load stress
stress --cpu 2 --timeout 300s --verbose

This script will stress out the 2 cpu for 5 minutes.

Make the stress script an executable.

chmod +x stress.sh

Run our script on Slurm

sbatch stress.sh

Now give it a second or so and check the queue

squeue

You should see something like this

Run this a few times to load up the scheduler with a "job-array".

sbatch -a [1-25] stress.sh
sbatch -a [1-20] stress.sh

Sit back and don't stress.

You can check the Slurm scheduler by running this command:

squeue

We can now see there is not enough compute to run these jobs.

Go back to RONIN and refresh the machine card. The number of compute nodes should be growing. (It may take a minute or 2)

If we look in our terminal and run:

squeue

We should see some jobs being allocated.

🕙
Have patience, even though RONIN may have launched the extra compute nodes they still need some time to get set up and become available to the scheduler.

But wait, there is more!!

In our desktop application RONIN LINK open. Click the "Connect to Machine" button. If you don't see this button on your RONIN LINK you really should update it here.

Scroll down and click the "Link" button on the Ganglia card.

This will ask you for your local computer's password because we are trying to map a port below 1024. Enter your laptop or computer's password.

You will now see a tab in your browser open with some really "cool" 90's graphs.

This is your monitoring tools for you cluster. You can learn more about Ganglia here or ask the RONIN Community.

Your cluster will also scale back down when it has run all the jobs.

💡
The compute nodes will idle for around ~10 minutes before they are terminated. This should give enough time for everything to finish. The other benefit is if more jobs come into the queue they can pick them up straight away.

Welcome to the club, you are officially a nerd!