Hi John, Mark,

We use a spank plugin 
https://gitlab.com/greg.wickham/slurm-spank-private-tmpdir (this was derived 
from other authors but modified for functionality required on site).

It can bind tmpfs mount points to the users cgroup allocation, additionally 
bind options can be provided (ie: limit memory by size, limit memory by % as 
supported by tmpfs(5))

More information is in the README.md

  -Greg

On 05/04/2022, 23:17, "slurm-users" <slurm-users-boun...@lists.schedmd.com> 
wrote:

I've thought-experimented this in the past, wanting to do the same thing but 
haven't found any way to get a/dev/shm or a tmpfs into a job's cgroups to be 
accounted against the job's allocation. The best I have come up with is 
creating a per-job tmpfs from a prolog, removing from epilog and setting its 
size to be some amount of memory that at least puts some restriction on how 
much damage the job could do. Another alternative is to only allow access to a 
memory filesystem if the job request is exclusive and takes the whole node. 
Crude, but effective at least to the point of preventing one job from killing 
others. If you happen to find a real solution, please post it :)

griznog

On Mon, Apr 4, 2022 at 10:19 AM Mark Coatsworth 
<mark.coatswo...@vectorinstitute.ai<mailto:mark.coatswo...@vectorinstitute.ai>> 
wrote:
Hi all,

We have a GPU cluster (Slurm 19.05.3) that typically runs large PyTorch jobs 
dependent on shared memory (/dev/shm). When our machines get busy, we often run 
into a problem where one job exhausts all the shared memory on a system, 
causing any other jobs landing there to fail immediately.

We're trying to figure out a good way to manage this resource. I know that 
Slurm counts shared memory as part of a job's total memory allocation, so we 
could use cgroups to OOM kill jobs that exceed this. But that doesn't prevent a 
user from just making a large request and exhausting it all anyway.

Does anybody have any thoughts or experience with setting real limits on shared 
memory, and either swapping it out or killing the job if this gets exceeded? 
One thought we had was to use a new generic resource (GRES). This is pretty 
easy to add in the configuration, but seems like it would be a huge task to 
write a plugin that actually enforces it.

Is this something where the Job Container plugin might be useful?

Any thoughts or suggestions would be appreciated,

Mark

Reply via email to