Troubleshooting Tips for Slurm and AWS ParallelCluster
In this blog post we present some tips to help you troubleshoot some of the most common issues researchers come across when using SLURM and ParallelCluster.
We at RONIN still think it’s magic every time you push a few buttons and get an auto-scaling cloud supercomputer, preconfigured with the spack package manager, in about 8 minutes. But sometimes things don’t work quite as expected. Here we describe the most common issues that users encounter and how to fix them.
Setting Slurm memory directives
Are you trying to set the memory of your job in your Slurm script and encountering the following error?
sbatch: error: Memory specification can not be satisfied sbatch: error: Batch job submission failed: Requested node configuration is not available
By default, ParallelCluster does not support Slurm memory directives (e.g.
--mem), so if you try to set the required memory per node, the respective nodes will automatically go into a 'drained' state. Because you have the whole cluster to yourself, you can usually just omit any memory directives from your script (if you intend to run a single job per node, or know the maximum memory requirements for multiple jobs total less than the available RAM on a single node); otherwise, if you do wish to specify the required memory for the job, follow the steps below.
RealMemory available in the compute instance. We can get this by running the following command:
You should see something like this:
Note: You will notice that the
RealMemoryavailable on the compute node is a little less than the memory you will see when selecting your compute instance types. This is because some memory needs to be reserved for the memory schedulers to work correctly.
/opt/slurm/etc/slurm.conf and add
NodeName=DEFAULT RealMemory=[RealMemory for compute nodes] BEFORE
Note: Ideally we should just use the
RealMemorywe got from
/opt/slurm/sbin/slurmd -C, but
RealMemorymight be different even for different instances of the same instance type. We recommend rounding down
RealMemoryslightly (e.g. to the nearest GB) just to avoid any potential error when the cluster is scaling and launching new instances. If you specify a value that is higher than the actual
RealMemory, the node will automatically go into a 'drained' state.
Implement the changes by restarting
sudo systemctl restart slurmctld
You should see that the memory is now configured when you run:
scontrol show nodes
You can now successfully specify Slurm memory directives in your scripts, just ensure that you don't specify more memory than what you added to the configuration file in Step 2.
Getting nodes out of a 'drained' state
If you accidentally used a Slurm memory directive before doing the steps above, or specified
RealMemory incorrectly and now have nodes in a 'drained' state, follow the steps below to return your nodes back to the 'idle' state:
Get the IP address of your 'drained' node (under the NODELIST column) by running the command
Change the node back to an 'idle' state by running:
sudo /opt/slurm/bin/scontrol update nodename=ip-10-255-6-163 state=idle
Remember to replace the example IP with the IP address you got in Step 1.
Ensure that you have set
RealMemory correctly (or remove any Slurm memory directives from your script) and submit your job again. It should be able to run successfully.
Note: There may be rare instances where there is a different issue with Slurm and ParallelCluster that can't easily be resolved. In these situations it is often best to package your cluster and create a new cluster from the package.