Hi Dominik, Do you have ConstrainDevices=yes set in your cgroup.conf?
Best, -Sean On Thu, Oct 27, 2022 at 11:49 AM Dominik Baack < dominik.ba...@cs.uni-dortmund.de> wrote: > Hi, > > We are in the process of setting up SLURM on some DGX A100 nodes . We > are experiencing the problem that all GPUs are available for users, even > for jobs where only one should be assigned. > > It seems the requirement is forwarded correctly to the node, at least > CUDA_VISIBLE_DEVICES is set to the correct id only discarded by the rest > of the system. > > Cheers > Dominik Baack > > Example: > > baack@gwkilab:~$ srun --gpus=1 nvidia-smi > Thu Oct 27 17:39:04 2022 > > +-----------------------------------------------------------------------------+ > | NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: > 11.4 | > > |-------------------------------+----------------------+----------------------+ > | GPU Name Persistence-M| Bus-Id Disp.A | Volatile > Uncorr. ECC | > | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util > Compute M. | > | | | MIG M. | > > |===============================+======================+======================| > | 0 NVIDIA A100-SXM... On | 00000000:07:00.0 Off > | 0 | > | N/A 28C P0 52W / 400W | 0MiB / 40536MiB | 0% Default | > | | | Disabled | > > +-------------------------------+----------------------+----------------------+ > | 1 NVIDIA A100-SXM... On | 00000000:0F:00.0 Off > | 0 | > | N/A 28C P0 51W / 400W | 0MiB / 40536MiB | 0% Default | > | | | Disabled | > > +-------------------------------+----------------------+----------------------+ > | 2 NVIDIA A100-SXM... On | 00000000:47:00.0 Off > | 0 | > | N/A 28C P0 52W / 400W | 0MiB / 40536MiB | 0% Default | > | | | Disabled | > > +-------------------------------+----------------------+----------------------+ > | 3 NVIDIA A100-SXM... On | 00000000:4E:00.0 Off > | 0 | > | N/A 29C P0 54W / 400W | 0MiB / 40536MiB | 0% Default | > | | | Disabled | > > +-------------------------------+----------------------+----------------------+ > | 4 NVIDIA A100-SXM... On | 00000000:87:00.0 Off > | 0 | > | N/A 34C P0 57W / 400W | 0MiB / 40536MiB | 0% Default | > | | | Disabled | > > +-------------------------------+----------------------+----------------------+ > | 5 NVIDIA A100-SXM... On | 00000000:90:00.0 Off > | 0 | > | N/A 31C P0 55W / 400W | 0MiB / 40536MiB | 0% Default | > | | | Disabled | > > +-------------------------------+----------------------+----------------------+ > | 6 NVIDIA A100-SXM... On | 00000000:B7:00.0 Off > | 0 | > | N/A 31C P0 51W / 400W | 0MiB / 40536MiB | 0% Default | > | | | Disabled | > > +-------------------------------+----------------------+----------------------+ > | 7 NVIDIA A100-SXM... On | 00000000:BD:00.0 Off > | 0 | > | N/A 32C P0 52W / 400W | 0MiB / 40536MiB | 0% Default | > | | | Disabled | > > +-------------------------------+----------------------+----------------------+ > > > +-----------------------------------------------------------------------------+ > | Processes: | > | GPU GI CI PID Type Process name GPU Memory | > | ID ID Usage | > > |=============================================================================| > | No running processes > found | > > +-----------------------------------------------------------------------------+ > > >