Dear all, I'm just starting to get used to Slurm and play around with it in a small test environment within our old cluster.
For our next system we will probably have to abandon our current exclusive user node access policy in favor of a shared user policy, i.e. jobs from different users will then run side by side on the same node at the same time. In order to prevent the jobs from interfering with each other, I have set both ConstrainCores=yes and ConstrainRAMSpace=yes in cgroups.conf, which works as expected for limiting the memory of the processes to the value requested at job submission (e.g. by --mem=... option). However, I've noticed that ConstrainRAMSpace=yes does also cap the available page cache for which the Linux kernel normally exploits any unused areas of the memory in a flexible way. This may result in a significant performance impact as we do have quite a number of IO demanding applications (predominated by read operations) that are known to benefit a lot from page caching. Here comes a small example to illustrate this issue. The job writes a 16 GB file to a local scratch file system, measures the amount of data cached in memory and then reads the file previously written. $ cat job.slurm #!/bin/bash #SBATCH --partition=standard #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --time=00:10:00 # Get amount of data cached in memory before writing the file cached1=`awk '$1=="Cached:" {print $2}' /proc/meminfo` # Write 16 GB file to local scratch SSD dd if=/dev/zero of=$SCRATCH/testfile count=16 bs=1024M # Get amount of data cached in memory after writing the file cached2=`awk '$1=="Cached:" {print $2}' /proc/meminfo` # Print difference of data cached in memory echo -e "\nIncreased cached data by $(((cached2-cached1)/1000000)) GB\n" # Read the file previously written dd if=$SCRATCH/testfile of=/dev/null count=16 bs=1024M $ For reference, this is the result *without* ConstrainRAMSpace=yes set in cgroups.conf and submitted with `sbatch --mem=2G --gres=scratch:16 job.slurm´ --- snip --- 16+0 records in 16+0 records out 17179869184 bytes (17 GB) copied, 10.9839 s, 1.6 GB/s Increased cached data by 16 GB 16+0 records in 16+0 records out 17179869184 bytes (17 GB) copied, 5.03225 s, 3.4 GB/s --- snip --- Note that there is 16 GB of data cached and the read performance is 3.4 GB/s as the data is actually read from page cache. And this is the result *with* ConstrainRAMSpace=yes set in cgroups.conf and submitted with the very same command: --- snip --- 16+0 records in 16+0 records out 17179869184 bytes (17 GB) copied, 13.3163 s, 1.3 GB/s Increased cached data by 1 GB 16+0 records in 16+0 records out 17179869184 bytes (17 GB) copied, 11.1098 s, 1.5 GB/s --- snip --- Now only 1 GB of data has been cached (which is roughly the 2 GB that have been requested for the job minus 1 GB allocated by the dd buffer) resulting in a read performance degradation to 1.5 GB/s (compared to 3.4 GB/s as above). Finally, this is the result with *with* ConstrainRAMSpace=yes set in cgroups.conf and the job submitted with `sbatch --mem=18G --gres=scratch:16 job.slurm´: --- snip --- 16+0 records in 16+0 records out 17179869184 bytes (17 GB) copied, 11.0601 s, 1.6 GB/s Increased cached data by 16 GB 16+0 records in 16+0 records out 17179869184 bytes (17 GB) copied, 5.01643 s, 3.4 GB/s --- snip --- This is almost the same result as in the unconstrained case (i.e. without ConstrainRAMSpace=yes set in cgroups.conf) as the amount of memory requested for the job (18 GB) is large enough to allow the file to be fully cached in memory. I do not think this is an issue with Slurm itself but how cgroups are supposed to work. However, I wonder how others cope with this. Maybe we have to teach our users to also consider page cache when requesting a certain amount of memory for their jobs? Any comment or idea would be highly appreciated. Thank you in advance. Best regards Jürgen -- Jürgen Salk Scientific Software & Compute Services (SSCS) Kommunikations- und Informationszentrum (kiz) Universität Ulm Telefon: +49 (0)731 50-22478 Telefax: +49 (0)731 50-22471