Using Spot Instances to Reduce Cluster Costs

What is a Spot Instance?

A Spot Instance is a standard AWS EC2 instance (machine) that is available for less than the On-Demand price due to spare EC2 capacity. By requesting unused EC2 instances at a discounted price (typically between 70-90% less than On-Demand), you can significantly reduce your cluster costs.

How Do Spot Instances Work?

The hourly price for a Spot Instance is called a Spot price. The Spot prices are quite predictable and are determined by supply and demand for Amazon EC2 spare capacity. You can see the current Spot price for instances within your region here.

In RONIN, requesting Spot Instances for your auto scale cluster nodes is as simple as selecting the ENABLE SPOT button during step 10 of cluster creation and setting a maximal spot price:

The maximum spot price is the maximum price you would be willing to pay for a Spot instance hour. If there is Spot capacity available for your selected instance type, and the current Spot price is below or equal to your specified maximum, you will pay the current Spot market price for your compute nodes.

So What's the Catch?

Well, because the AWS spot market is based on supply and demand, your preferred instance type may not be available when you need it. Some instance types (such as older generation instances) are usually highly available, while other instance types that are in high demand (such as GPU instance types) may rarely have unused capacity available for the Spot market. There may also be certain times of day, days of the week, or days of the year that Spot instance availability is reduced. For example, during peak online shopping periods such as Black Friday, businesses will scale their EC2 instance numbers to meet customer demand — this results in fewer unused instances available for the Spot market.

If there are no Spot instances available when you create your cluster (or if the Spot price rises above your specified maximal price), your cluster will usually fail to launch. Alternatively, sometimes there might be enough Spot instances available for you to launch your cluster, but there may not be enough capacity for your cluster to scale to your maximal number of compute nodes. Finally, your running jobs may get interrupted if AWS needs to reclaim some EC2 instances, no matter how high you have set your maximum Spot price. You can compare the chance of interruption for instance types within your region here.

Spot Instances receive a two-minute interruption notice when these instances are about to be reclaimed by EC2. This can be because the current Spot price has gone above your nominated maximal Spot price, or because AWS needs the capacity back for an On-Demand customer. This means that the Spot instances are best suited for fault-tolerant, flexible workloads. For example, applications that use checkpointing to periodically save their state can be restarted where they left off. Similarly, workloads run using a workflow manager such as Nextflow can resume from where they were last interrupted — see here for more information.

Tips for using Spot Instances

  1. Ensure your workflow can handle interruptions, and/or ensure that the potential for interruption of your chosen instance type is low to prevent any wasted usage costs.
  2. Ensure you check and compare the Spot prices for your desired EC2 instance type and set a reasonable maximal Spot price when you create your cluster. If you don't have a strict limit in mind but would like to take advantage of any cost savings that are possible over the standard On-Demand price, we recommend setting the maximal Spot price equal to the On-Demand price. Because you will only ever pay the current Spot price, this significantly reduces the likelihood of your instances being reclaimed, while still allowing you to obtain maximal savings.
  3. Be aware of any vCPU-based instance limits (see below) and other EC2 instance quotas that may affect how many instances you can run. Remember that users in other RONIN projects may also be launching clusters and leveraging Spot instances. If you believe there should be capacity available for your chosen Spot instance type but your cluster continually fails to launch, you may need to speak with your RONIN administrators to request an increase to your account EC2 quotas.

Overall, Spot Instances can help you achieve significant cost savings when running an auto scale cluster. Please refer to these FAQs for more information or contact your local AWS representative if you would like to learn more about how you might be able to leverage the Spot Market for your workloads.