Yes, I was just testing that. Adding "Delegate=yes" seems to fix the problem (see below), but wanted to try a few more things before saying anything.
[computelab-136:~]$ grep ^Delegate /etc/systemd/system/slurmd.service Delegate=yes [computelab-136:~]$ nvidia-smi --query-gpu=index,name --format=csv index, name 0, Tesla T4 [computelab-136:~]$ sudo systemctl daemon-reload; sudo systemctl restart slurmd [computelab-136:~]$ nvidia-smi --query-gpu=index,name --format=csv index, name 0, Tesla T4 On Thu, Apr 11, 2019 at 7:53 AM Marcus Wagner <wag...@itc.rwth-aachen.de> wrote: > Hi Randall, > > could you please for a test add the following lines to the service part of > the slurmd.service file (or add an override file). > > Delegate=yes > > > Best > Marcus > > > > On 4/11/19 3:11 PM, Randall Radmer wrote: > > It's now distressingly simple to reproduce this, based on Kilinan's clue > (off topic, "Kilinan's Clue" sounds like a good title for a Hardy Boys > Mystery Story). > > After limited testing, seems to me that running "systemctl > daemon-reload" followed by "systemctl restart slurmd" breaks it. See > below: > > [computelab-305:~]$ sudo systemctl restart slurmd > [computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv > index, name > 0, Tesla T4 > [computelab-305:~]$ sudo systemctl daemon-reload > [computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv > index, name > 0, Tesla T4 > [computelab-305:~]$ sudo systemctl restart slurmd > [computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv > index, name > 0, Tesla T4 > 1, Tesla T4 > 2, Tesla T4 > 3, Tesla T4 > 4, Tesla T4 > 5, Tesla T4 > 6, Tesla T4 > 7, Tesla T4 > [computelab-305:~]$ slurmd -V > slurm 17.11.9-2 > > > On Wed, Apr 10, 2019 at 3:59 PM Kilian Cavalotti < > kilian.cavalotti.w...@gmail.com> wrote: > >> Hi Randy! >> >> > We have a slurm cluster with a number of nodes, some of which have more >> than one GPU. Users select how many or which GPUs they want with srun's >> "--gres" option. Nothing fancy here, and in general this works as >> expected. But starting a few days ago we've had problems on one machine. >> A specific user started a single-gpu session with srun, and nvidia-smi >> reported one GPU, as expected. But about two hours later, he suddenly >> could see all GPUs with nvidia-smi. To be clear, this is all from the >> iterative session provided by Slurm. He did not ssh to the machine. He's >> not running Docker. Nothing odd as far as we can tell. >> > >> > A big problem is I've been unable to reproduce the problem. I have >> confidence that what this user is telling me is correct, but I can't do >> much until/unless I can reproduce it. >> >> I think this kind of behavior has already been reported a few times: >> https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html >> https://bugs.schedmd.com/show_bug.cgi?id=5300 >> >> As far as I can tell, it looks like this is probably systemd messing >> up with cgroups and deciding it's the king of cgroups on the host. >> >> You'll find more context and details in >> https://bugs.schedmd.com/show_bug.cgi?id=5292 >> >> Cheers, >> -- >> Kilian >> >> > -- > Marcus Wagner, Dipl.-Inf. > > IT Center > Abteilung: Systeme und Betrieb > RWTH Aachen University > Seffenter Weg 23 > 52074 Aachen > Tel: +49 241 80-24383 > Fax: +49 241 80-624383wag...@itc.rwth-aachen.dewww.itc.rwth-aachen.de > >