Hi Randall,
could you please for a test add the following lines to the service part
of the slurmd.service file (or add an override file).
Delegate=yes
Best
Marcus
On 4/11/19 3:11 PM, Randall Radmer wrote:
It's now distressingly simple to reproduce this, based on Kilinan's
clue (off topic, "Kilinan's Clue" sounds like a good title for a Hardy
Boys Mystery Story).
After limited testing, seems to me that running "systemctl
daemon-reload" followed by "systemctl restart slurmd" breaks it. See
below:
[computelab-305:~]$ sudo systemctl restart slurmd
[computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv
index, name
0, Tesla T4
[computelab-305:~]$ sudo systemctl daemon-reload
[computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv
index, name
0, Tesla T4
[computelab-305:~]$ sudo systemctl restart slurmd
[computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv
index, name
0, Tesla T4
1, Tesla T4
2, Tesla T4
3, Tesla T4
4, Tesla T4
5, Tesla T4
6, Tesla T4
7, Tesla T4
[computelab-305:~]$ slurmd -V
slurm 17.11.9-2
On Wed, Apr 10, 2019 at 3:59 PM Kilian Cavalotti
<kilian.cavalotti.w...@gmail.com
<mailto:kilian.cavalotti.w...@gmail.com>> wrote:
Hi Randy!
> We have a slurm cluster with a number of nodes, some of which
have more than one GPU. Users select how many or which GPUs they
want with srun's "--gres" option. Nothing fancy here, and in
general this works as expected. But starting a few days ago we've
had problems on one machine. A specific user started a single-gpu
session with srun, and nvidia-smi reported one GPU, as expected.
But about two hours later, he suddenly could see all GPUs with
nvidia-smi. To be clear, this is all from the iterative session
provided by Slurm. He did not ssh to the machine. He's not
running Docker. Nothing odd as far as we can tell.
>
> A big problem is I've been unable to reproduce the problem. I
have confidence that what this user is telling me is correct, but
I can't do much until/unless I can reproduce it.
I think this kind of behavior has already been reported a few times:
https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html
https://bugs.schedmd.com/show_bug.cgi?id=5300
As far as I can tell, it looks like this is probably systemd messing
up with cgroups and deciding it's the king of cgroups on the host.
You'll find more context and details in
https://bugs.schedmd.com/show_bug.cgi?id=5292
Cheers,
--
Kilian
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de