Troubleshooting Tips for Slurm and AWS ParallelCluster

In this blog post we present some tips to help you troubleshoot some of the most common issues researchers come across when using SLURM and ParallelCluster.

Troubleshooting Tips for Slurm and AWS ParallelCluster

We at RONIN still think it’s magic every time you push a few buttons and get an auto-scaling cloud supercomputer, preconfigured with the spack package manager, in about 8 minutes. But sometimes things don’t work quite as expected. Here we describe the most common issues that users encounter and how to fix them.

Setting Slurm memory directives

Are you trying to set the memory of your job in your Slurm script and encountering the following error?

sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available

By default, ParallelCluster does not support Slurm memory directives (e.g. --mem), so if you try to set the required memory per node, the respective nodes will automatically go into a 'drained' state. Because you have the whole cluster to yourself, you can usually just omit any memory directives from your script (if you intend to run a single job per node, or know the maximum memory requirements for multiple jobs total less than the available RAM on a single node); otherwise, if you do wish to specify the required memory for the job, follow the steps below.

Step 1

Determine the RealMemory available in the compute instance. We can get this by running the following command:

/opt/slurm/sbin/slurmd -C

You should see something like this:

RealMemory=491805
Note: You will notice that the RealMemory available on the compute node is a little less than the memory you will see when selecting your compute instance types. This is because some memory needs to be reserved for the memory schedulers to work correctly.

Step 2

Edit /opt/slurm/etc/slurm.conf and add NodeName=DEFAULT RealMemory=[RealMemory for compute nodes] BEFORE include slurm_parallelcluster_nodes.conf.

Note: Ideally we should just use the RealMemory we got from /opt/slurm/sbin/slurmd -C, but RealMemory might be different even for different instances of the same instance type. We recommend rounding down RealMemory slightly (e.g. to the nearest GB) just to avoid any potential error when the cluster is scaling and launching new instances. If you specify a value that is higher than the actual RealMemory, the node will automatically go into a 'drained' state.

Step 3

Implement the changes by restarting slurmctld:

sudo systemctl restart slurmctld

You should see that the memory is now configured when you run:

scontrol show nodes

You can now successfully specify Slurm memory directives in your scripts, just ensure that you don't specify more memory than what you added to the configuration file in Step 2.

Getting nodes out of a 'drained' state

If you accidentally used a Slurm memory directive before doing the steps above, or specified RealMemory incorrectly and now have nodes in a 'drained' state, follow the steps below to return your nodes back to the 'idle' state:

Step 1

Get the IP address of your 'drained' node (under the NODELIST column) by running the command sinfo.

Step 2

Change the node back to an 'idle' state by running:

sudo /opt/slurm/bin/scontrol update nodename=ip-10-255-6-163 state=idle
Remember to replace the example IP with the IP address you got in Step 1.

Step 3

Ensure that you have set RealMemory correctly (or remove any Slurm memory directives from your script) and submit your job again. It should be able to run successfully.

Note: There may be rare instances where there is a different issue with Slurm and ParallelCluster that can't easily be resolved. In these situations it is often best to package your cluster and create a new cluster from the package.