It's now distressingly simple to reproduce this, based on Kilinan's clue (off topic, "Kilinan's Clue" sounds like a good title for a Hardy Boys Mystery Story).
After limited testing, seems to me that running "systemctl daemon-reload" followed by "systemctl restart slurmd" breaks it. See below: [computelab-305:~]$ sudo systemctl restart slurmd [computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv index, name 0, Tesla T4 [computelab-305:~]$ sudo systemctl daemon-reload [computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv index, name 0, Tesla T4 [computelab-305:~]$ sudo systemctl restart slurmd [computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv index, name 0, Tesla T4 1, Tesla T4 2, Tesla T4 3, Tesla T4 4, Tesla T4 5, Tesla T4 6, Tesla T4 7, Tesla T4 [computelab-305:~]$ slurmd -V slurm 17.11.9-2 On Wed, Apr 10, 2019 at 3:59 PM Kilian Cavalotti < kilian.cavalotti.w...@gmail.com> wrote: > Hi Randy! > > > We have a slurm cluster with a number of nodes, some of which have more > than one GPU. Users select how many or which GPUs they want with srun's > "--gres" option. Nothing fancy here, and in general this works as > expected. But starting a few days ago we've had problems on one machine. > A specific user started a single-gpu session with srun, and nvidia-smi > reported one GPU, as expected. But about two hours later, he suddenly > could see all GPUs with nvidia-smi. To be clear, this is all from the > iterative session provided by Slurm. He did not ssh to the machine. He's > not running Docker. Nothing odd as far as we can tell. > > > > A big problem is I've been unable to reproduce the problem. I have > confidence that what this user is telling me is correct, but I can't do > much until/unless I can reproduce it. > > I think this kind of behavior has already been reported a few times: > https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html > https://bugs.schedmd.com/show_bug.cgi?id=5300 > > As far as I can tell, it looks like this is probably systemd messing > up with cgroups and deciding it's the king of cgroups on the host. > > You'll find more context and details in > https://bugs.schedmd.com/show_bug.cgi?id=5292 > > Cheers, > -- > Kilian > >