Hi Randy! > We have a slurm cluster with a number of nodes, some of which have more than > one GPU. Users select how many or which GPUs they want with srun's "--gres" > option. Nothing fancy here, and in general this works as expected. But > starting a few days ago we've had problems on one machine. A specific user > started a single-gpu session with srun, and nvidia-smi reported one GPU, as > expected. But about two hours later, he suddenly could see all GPUs with > nvidia-smi. To be clear, this is all from the iterative session provided by > Slurm. He did not ssh to the machine. He's not running Docker. Nothing odd > as far as we can tell. > > A big problem is I've been unable to reproduce the problem. I have > confidence that what this user is telling me is correct, but I can't do much > until/unless I can reproduce it.
I think this kind of behavior has already been reported a few times: https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html https://bugs.schedmd.com/show_bug.cgi?id=5300 As far as I can tell, it looks like this is probably systemd messing up with cgroups and deciding it's the king of cgroups on the host. You'll find more context and details in https://bugs.schedmd.com/show_bug.cgi?id=5292 Cheers, -- Kilian