Since they took the patch, it's not needed if you're using the version they
fixed. However it looks like they haven't released that version yet. The patch
is to slurmd. You don't need it on the controller. If you're only having
problems with some systems, you can put it just on those systems, bu
To see the specific GPU allocated, I think this will do it:
scontrol show job -d | grep -E "JobId=| GRES"
From: slurm-users on behalf of Mahendra
Paipuri
Sent: Sunday, January 7, 2024 3:33 PM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] GPU devices
See my comments on https://bugs.launchpad.net/bugs/2050098. There's a pretty
simple fix in slurm.
As far as I can tell, there's nothing wrong with the slurm code. But it's using
an option that it doesn't actually need, and that seems to be causing trouble
in the kernel.
__
With cgroup v2 allowedswapspace is implemented using memory.swap.max.
In v1, it is memsw.max. This applies to the total of memory and swap.
In v2, memory.swap.max is only swap.
Slurm adds the job memory size to allowedswapspace. This is appropriate for v1,
since the limit is on the sum. It is no
It appears that when you set AllowedSwapSpace in cgroups.conf, this percent is
added to the requested memory amount. So to set memoryswapmax to 0, you have to
set AllowedSwapSpace to -100. Arithmetic is imprecise and parts seem to use
signed float, so you can get unexpected results. I'm actually