We have a slurm cluster with a number of nodes, some of which have more than one GPU. Users select how many or which GPUs they want with srun's "--gres" option. Nothing fancy here, and in general this works as expected. But starting a few days ago we've had problems on one machine. A specific user started a single-gpu session with srun, and nvidia-smi reported one GPU, as expected. But about two hours later, he suddenly could see all GPUs with nvidia-smi. To be clear, this is all from the iterative session provided by Slurm. He did not ssh to the machine. He's not running Docker. Nothing odd as far as we can tell.
A big problem is I've been unable to reproduce the problem. I have confidence that what this user is telling me is correct, but I can't do much until/unless I can reproduce it. My general question for this group is how to debug this if/when I am able to reproduce it? My specific question is how does Slum limit user access to GPUs? I understand it's with cgroups, and I think I can see how it works with CPUs (i.e. /sys/fs/cgroup/cpuset/slurm/uid_*/job_*/cpuset.cpus). But I can't figure out where or how GPUs are assigned to cgroups (if that's even the correct phrase). I did not find anything that looked interesting under /sys/fs/cgroup/devices/slurm/uid_*/job_* For example: $ find /sys/fs/cgroup/devices/slurm/uid_*/job_* -name devices.list -exec cat {} \; a *:* rwm a *:* rwm a *:* rwm a *:* rwm a *:* rwm a *:* rwm Thanks much, Randy