I guess my next question is, are there any negative repercussions to setting "Delegate=yes" in slurmd.service?
On Thu, Apr 11, 2019 at 8:21 AM Marcus Wagner <wag...@itc.rwth-aachen.de> wrote: > I assume without Delegate=yes this would happen also to regular jobs, > which means, nightly updates could "destroy" the cgroups created by slurm > and therefore let the jobs out "into the wild". > > Best > Marcus > > P.S.: > We had a similar problem with LSF > > On 4/11/19 3:58 PM, Randall Radmer wrote: > > Yes, I was just testing that. Adding "Delegate=yes" seems to fix the > problem (see below), but wanted to try a few more things before saying > anything. > > [computelab-136:~]$ grep ^Delegate /etc/systemd/system/slurmd.service > Delegate=yes > [computelab-136:~]$ nvidia-smi --query-gpu=index,name --format=csv > index, name > 0, Tesla T4 > [computelab-136:~]$ sudo systemctl daemon-reload; sudo systemctl restart > slurmd > [computelab-136:~]$ nvidia-smi --query-gpu=index,name --format=csv > index, name > 0, Tesla T4 > > > > On Thu, Apr 11, 2019 at 7:53 AM Marcus Wagner <wag...@itc.rwth-aachen.de> > wrote: > >> Hi Randall, >> >> could you please for a test add the following lines to the service part >> of the slurmd.service file (or add an override file). >> >> Delegate=yes >> >> >> Best >> Marcus >> >> >> >> On 4/11/19 3:11 PM, Randall Radmer wrote: >> >> It's now distressingly simple to reproduce this, based on Kilinan's clue >> (off topic, "Kilinan's Clue" sounds like a good title for a Hardy Boys >> Mystery Story). >> >> After limited testing, seems to me that running "systemctl >> daemon-reload" followed by "systemctl restart slurmd" breaks it. See >> below: >> >> [computelab-305:~]$ sudo systemctl restart slurmd >> [computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv >> index, name >> 0, Tesla T4 >> [computelab-305:~]$ sudo systemctl daemon-reload >> [computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv >> index, name >> 0, Tesla T4 >> [computelab-305:~]$ sudo systemctl restart slurmd >> [computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv >> index, name >> 0, Tesla T4 >> 1, Tesla T4 >> 2, Tesla T4 >> 3, Tesla T4 >> 4, Tesla T4 >> 5, Tesla T4 >> 6, Tesla T4 >> 7, Tesla T4 >> [computelab-305:~]$ slurmd -V >> slurm 17.11.9-2 >> >> >> On Wed, Apr 10, 2019 at 3:59 PM Kilian Cavalotti < >> kilian.cavalotti.w...@gmail.com> wrote: >> >>> Hi Randy! >>> >>> > We have a slurm cluster with a number of nodes, some of which have >>> more than one GPU. Users select how many or which GPUs they want with >>> srun's "--gres" option. Nothing fancy here, and in general this works as >>> expected. But starting a few days ago we've had problems on one machine. >>> A specific user started a single-gpu session with srun, and nvidia-smi >>> reported one GPU, as expected. But about two hours later, he suddenly >>> could see all GPUs with nvidia-smi. To be clear, this is all from the >>> iterative session provided by Slurm. He did not ssh to the machine. He's >>> not running Docker. Nothing odd as far as we can tell. >>> > >>> > A big problem is I've been unable to reproduce the problem. I have >>> confidence that what this user is telling me is correct, but I can't do >>> much until/unless I can reproduce it. >>> >>> I think this kind of behavior has already been reported a few times: >>> https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html >>> https://bugs.schedmd.com/show_bug.cgi?id=5300 >>> >>> As far as I can tell, it looks like this is probably systemd messing >>> up with cgroups and deciding it's the king of cgroups on the host. >>> >>> You'll find more context and details in >>> https://bugs.schedmd.com/show_bug.cgi?id=5292 >>> >>> Cheers, >>> -- >>> Kilian >>> >>> >> -- >> Marcus Wagner, Dipl.-Inf. >> >> IT Center >> Abteilung: Systeme und Betrieb >> RWTH Aachen University >> Seffenter Weg 23 >> 52074 Aachen >> Tel: +49 241 80-24383 >> Fax: +49 241 80-624383wag...@itc.rwth-aachen.dewww.itc.rwth-aachen.de >> >> > -- > Marcus Wagner, Dipl.-Inf. > > IT Center > Abteilung: Systeme und Betrieb > RWTH Aachen University > Seffenter Weg 23 > 52074 Aachen > Tel: +49 241 80-24383 > Fax: +49 241 80-624383wag...@itc.rwth-aachen.dewww.itc.rwth-aachen.de > >