It looks like you are missing some of the slurm.conf entries related to enforcing the cgroup restrictions. I would go through the list here and verify/adjust your configuration:
https://slurm.schedmd.com/cgroup.conf.html#OPT_/etc/slurm/slurm.conf Best, -Sean On Thu, Oct 27, 2022 at 1:04 PM Dominik Baack < dominik.ba...@cs.uni-dortmund.de> wrote: > Hi, > > yes ContrainDevices is set: > > ### > # Slurm cgroup support configuration file > ### > CgroupAutomount=yes > # > #CgroupMountpoint="/sys/fs/cgroup" > ConstrainCores=yes > ConstrainDevices=yes > ConstrainRAMSpace=yes > # > # > > I attached the slurm configuration file as well > > Cheers > Dominik > Am 27.10.2022 um 17:57 schrieb Sean Maxwell: > > Hi Dominik, > > Do you have ConstrainDevices=yes set in your cgroup.conf? > > Best, > > -Sean > > On Thu, Oct 27, 2022 at 11:49 AM Dominik Baack < > dominik.ba...@cs.uni-dortmund.de> wrote: > >> Hi, >> >> We are in the process of setting up SLURM on some DGX A100 nodes . We >> are experiencing the problem that all GPUs are available for users, even >> for jobs where only one should be assigned. >> >> It seems the requirement is forwarded correctly to the node, at least >> CUDA_VISIBLE_DEVICES is set to the correct id only discarded by the rest >> of the system. >> >> Cheers >> Dominik Baack >> >> Example: >> >> baack@gwkilab:~$ srun --gpus=1 nvidia-smi >> Thu Oct 27 17:39:04 2022 >> >> +-----------------------------------------------------------------------------+ >> | NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: >> 11.4 | >> >> |-------------------------------+----------------------+----------------------+ >> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile >> Uncorr. ECC | >> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util >> Compute M. | >> | | | MIG M. | >> >> |===============================+======================+======================| >> | 0 NVIDIA A100-SXM... On | 00000000:07:00.0 Off >> | 0 | >> | N/A 28C P0 52W / 400W | 0MiB / 40536MiB | 0% Default | >> | | | Disabled | >> >> +-------------------------------+----------------------+----------------------+ >> | 1 NVIDIA A100-SXM... On | 00000000:0F:00.0 Off >> | 0 | >> | N/A 28C P0 51W / 400W | 0MiB / 40536MiB | 0% Default | >> | | | Disabled | >> >> +-------------------------------+----------------------+----------------------+ >> | 2 NVIDIA A100-SXM... On | 00000000:47:00.0 Off >> | 0 | >> | N/A 28C P0 52W / 400W | 0MiB / 40536MiB | 0% Default | >> | | | Disabled | >> >> +-------------------------------+----------------------+----------------------+ >> | 3 NVIDIA A100-SXM... On | 00000000:4E:00.0 Off >> | 0 | >> | N/A 29C P0 54W / 400W | 0MiB / 40536MiB | 0% Default | >> | | | Disabled | >> >> +-------------------------------+----------------------+----------------------+ >> | 4 NVIDIA A100-SXM... On | 00000000:87:00.0 Off >> | 0 | >> | N/A 34C P0 57W / 400W | 0MiB / 40536MiB | 0% Default | >> | | | Disabled | >> >> +-------------------------------+----------------------+----------------------+ >> | 5 NVIDIA A100-SXM... On | 00000000:90:00.0 Off >> | 0 | >> | N/A 31C P0 55W / 400W | 0MiB / 40536MiB | 0% Default | >> | | | Disabled | >> >> +-------------------------------+----------------------+----------------------+ >> | 6 NVIDIA A100-SXM... On | 00000000:B7:00.0 Off >> | 0 | >> | N/A 31C P0 51W / 400W | 0MiB / 40536MiB | 0% Default | >> | | | Disabled | >> >> +-------------------------------+----------------------+----------------------+ >> | 7 NVIDIA A100-SXM... On | 00000000:BD:00.0 Off >> | 0 | >> | N/A 32C P0 52W / 400W | 0MiB / 40536MiB | 0% Default | >> | | | Disabled | >> >> +-------------------------------+----------------------+----------------------+ >> >> >> +-----------------------------------------------------------------------------+ >> | Processes: | >> | GPU GI CI PID Type Process name GPU Memory | >> | ID ID Usage | >> >> |=============================================================================| >> | No running processes >> found | >> >> +-----------------------------------------------------------------------------+ >> >> >>