At my old neuroimaging lab, we selected machines by asking, "How much money do we have?" and figuring out the most compute/memory/storage we could get for that sum to run the most difficult applications we had. Then we kept the computer for years and years until people laughed at it because it looked like something out of an 80's revival TV series and then finally everyone forgot how to log into it.
RONIN and cloud computing represent a different way of thinking. On RONIN, you ask "What's the cheapest way to get this result? And when do I need it?" Then you figure out the right machine to get the job done at a cost and timeframe that fits within your compute budget (Figure 1). This is great because you will never find yourself falling behind because you have outdated computer hardware and software. But choice means understanding how different selections for CPU, memory, storage, and other characteristics will affect the time to do your work. This blog post will provide some guidelines for choosing wisely.
When you choose a machine type, you are typically choosing a virtual machine. This means that when you stop and restart it (a reboot is not enough — we're talking about the stop button), it will appear on different physical hardware. This is a great trick if your machine acts weird; sometimes a stop/restart will correct things. But it also means that at any time you can stop a machine, change the machine type to one with more memory or a different CPU, and start it again. There are some limits to this; the software will need to be compatible with the new hardware. However, if you realize after you begin that you need something different it is easy to change.
The first thing to consider is how you will use the machine. Will you be editing some code? Installing software by downloading precompiled packages? Installing software by compiling for hours? Running an application for hours or days? Does it require a lot of memory? A GPU? Is your speed limited by reading/writing a lot of data? Is your code able to take advantage of multiple cores? If you are running an application and are not sure about the resources that your code requires, you can run it on a fairly capable machine at the same time that you run a CPU monitor such as
htop or a program such as Netdata. You will want to keep an eye on how many cores it uses, how efficiently it uses them, and how much memory you use. In some cases, disk speed will be the limiting factor in your performance.
Let us call the thing (cores, GPUs, memory, disk space, disk speed) that you tend to run out of first and that prevents you from doing more work faster your "critical resource".
Great, how do I choose a machine type?
To pick a machine type (in AWS, an "instance" type) it helps to understand how the different types are named and grouped. Disclaimer: this is how I understand it and not how AWS explains it. Each machine type has a name formed as follows:
First letter: Instance family
Number: Instance Generation
Subsequent letters: Attribute
- a - AMD processors
- g - AWS Graviton processors
- i - Intel processors
- d - Instance store volumes
- n - Network optimization
- b - Block storage optimization
- e - Extra storage or memory
- z - High frequency
After the period: Instance Size
Imagine there is a perfect ratio of CPU to memory to disk speed — some computer equivalent of the golden ratio. This would be a General Purpose Machine. Every type of "optimized" machine (e.g. memory-optimized, or compute-optimized) has more of the thing it is optimized for (memory, CPU) relative to the other things. If we keep all things except the instance family the same, compare a General Purpose m5.large to a Memory Optimized r5.large to a Compute Optimized c5.large (prices as of this writing in us-east-2):
We can see that the ratio of memory to VCPUs for the general purpose machine is 8:2, or 4. This is doubled for the memory optimized machine (16:2, or 8), and halved for the compute optimized machine (4:2, or 2).
TLDR: Basic algorithm for selecting a machine type:
- Profile your application to determine how much memory and how many processors you need, and if you need a lot of fast attached storage (this is relatively unusual). What is your critical resource? If you don't feel like doing this, start with general purpose.
- If YOU are the critical resource, because you are typing, coding, doing light testing and installation of precompiled packages, choose the "t" family.
- Otherwise, look within the class of machines optimized for your critical resource. Avoid instances that begin with "a" unless your RONIN administrator has made available a Graviton-compatible operating system image and your code works on it.
- Make sure the generation is the highest number in the series. The later generations are based on newer processors which are often more efficient and cost-effective than older ones.
- Look at the letters after the number. An "a" after the number means that it is an AMD processor and this is fine, but generally the modifiers mean the instances have extra things that affect the price. Avoid "n" (networking) unless you are building a very large cluster for tightly coupled workloads. Avoid "d" (disk) unless you need a lot of fast attached storage and understand how to format, mount and use it. Avoid "g" (Graviton) unless your RONIN administrator has made available a Graviton-compatible operating system image and your code works on it.
- If you need a GPU or any specialized hardware, just choose what you need and ignore the modifiers. Sometimes specialized instances are in short supply in limited configurations.
- Avoid machines with the modifier "metal"unless you know you need to run directly on the hardware. They sound like something out of a superhero movie, but these machines are not virtualized and are much slower to launch.
- Now select the size (based on the number of cores that you can use effectively) and launch.
All in the Family
You start by choosing a machine type family. Some rules of thumb here: Intel and AMD processors are quite different from Graviton/ARM processors and to use the latter you will need to make sure your RONIN admin has created an operating system image for your use, and you will also need to make sure all your software runs on this platform.
Instances that begin with the letter "t" are special. They are called "burstable" instances because they are designed to provide cost-efficient performance for workloads that do not use the CPU consistently (such as web sites). You can think of these instances as being "shared" among multiple users, but in a very specific way. Each machine has a baseline utilization per vCPU, listed on this page. When the machine utilization falls below this level, you accrue CPU credits, and when it goes above this level, you are charged CPU credits. It's a great way to save money if you are the critical resource — if you are writing code, doing very light installations, testing code that is not very compute-intensive. However, it means that if you consistently go above the baseline utilization (for example, by running several hours of computationally intense workflows, or compiling an enormous package with spack), eventually you will run out of credits and the machine will be super slow.
Sounds bad, right? But it's worse. By default, AWS enables "Unlimited mode" on t3 and higher instance types within an account. This means that when you exceed your credits, you automatically start spending to prevent slowing down — at a flat cost of up to 5 times the instance price in addition to the instance price. We disable Unlimited mode for you in RONIN because we care.
Within a family, a number indicates the processor generation. As long as the processor will work with your operating system and software, select the highest number. Newer processors are more cost-effective than older ones and will normally save money.
The default storage attached to RONIN machines is the "Drive Storage", which in AWS terminology is Elastic Block Storage. These storage volumes can be attached and detached to machines as needed. Instance types with a "d" modifier also come with local attached storage of various types (look at the Storage Type column). The attached storage can be very fast, and/or it may be very large. A big drawback of these attached drives is that when your machine stops, is terminated, or if the disk just fails, you will lose the data on the drives. For this reason, they are (rather romantically) called ephemeral drives. You will need to format and mount these drives to use them.
GPUs and Other Special Stuff
Specialized processor types (such as GPUs, FPGAs) often come with specific modifiers for expected workloads. In this case, don't worry too much about avoiding extra drive storage or networking capabilities, because the processor type you need may not be available without these features.
Now for Something Completely Different
Once you have gotten your workflow running on the perfect instance, what happens when you are finished and want to write some code to analyze your results? No worries — you just need to stop your machine and click on the little pencil next to the instance type (see below), and switch from a massive compute optimized instance to, perhaps, a general purpose "t" instance.
The perfect machine is one that is flexible enough for you to use to do your research within budget, on time, without worries. There is a perfect machine in RONIN for you.
Note: Feature image is inspired by the Scout Association board game, "The right tool for the job".