It's now distressingly simple to reproduce this, based on Kilinan's clue
(off topic, "Kilinan's Clue" sounds like a good title for a Hardy Boys
Mystery Story).

After limited testing, seems to me that running "systemctl
daemon-reload"  followed by "systemctl restart slurmd" breaks it.  See

[computelab-305:~]$ sudo systemctl restart slurmd
[computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv
index, name
0, Tesla T4
[computelab-305:~]$ sudo systemctl daemon-reload
[computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv
index, name
0, Tesla T4
[computelab-305:~]$ sudo systemctl restart slurmd
[computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv
index, name
0, Tesla T4
1, Tesla T4
2, Tesla T4
3, Tesla T4
4, Tesla T4
5, Tesla T4
6, Tesla T4
7, Tesla T4
[computelab-305:~]$ slurmd -V
slurm 17.11.9-2

On Wed, Apr 10, 2019 at 3:59 PM Kilian Cavalotti <> wrote:

> Hi Randy!
> > We have a slurm cluster with a number of nodes, some of which have more
> than one GPU.  Users select how many or which GPUs they want with srun's
> "--gres" option.  Nothing fancy here, and in general this works as
> expected.  But starting a few days ago we've had problems on one machine.
> A specific user started a single-gpu session with srun, and nvidia-smi
> reported one GPU, as expected.  But about two hours later, he suddenly
> could see all GPUs with nvidia-smi.  To be clear, this is all from the
> iterative session provided by Slurm.  He did not ssh to the machine.  He's
> not running Docker.  Nothing odd as far as we can tell.
> >
> > A big problem is I've been unable to reproduce the problem.  I have
> confidence that what this user is telling me is correct, but I can't do
> much until/unless I can reproduce it.
> I think this kind of behavior has already been reported a few times:
> As far as I can tell, it looks like this is probably systemd messing
> up with cgroups and deciding it's the king of cgroups on the host.
> You'll find more context and details in
> Cheers,
> --
> Kilian

Reply via email to