Re: [slurm-users] How does cgroups limit user access to GPUs?

Randall Radmer Thu, 11 Apr 2019 08:29:33 -0700

I guess my next question is, are there any negative repercussions to
setting "Delegate=yes" in slurmd.service?


On Thu, Apr 11, 2019 at 8:21 AM Marcus Wagner <wag...@itc.rwth-aachen.de>
wrote:

> I assume without Delegate=yes this would happen also to regular jobs,
> which means, nightly updates could "destroy" the cgroups created by slurm
> and therefore let the jobs out "into the wild".
>
> Best
> Marcus
>
> P.S.:
> We had a similar problem with LSF
>
> On 4/11/19 3:58 PM, Randall Radmer wrote:
>
> Yes, I was just testing that.  Adding "Delegate=yes" seems to fix the
> problem (see below), but wanted to try a few more things before saying
> anything.
>
> [computelab-136:~]$ grep ^Delegate /etc/systemd/system/slurmd.service
> Delegate=yes
> [computelab-136:~]$ nvidia-smi --query-gpu=index,name --format=csv
> index, name
> 0, Tesla T4
> [computelab-136:~]$ sudo systemctl daemon-reload; sudo systemctl restart
> slurmd
> [computelab-136:~]$ nvidia-smi --query-gpu=index,name --format=csv
> index, name
> 0, Tesla T4
>
>
>
> On Thu, Apr 11, 2019 at 7:53 AM Marcus Wagner <wag...@itc.rwth-aachen.de>
> wrote:
>
>> Hi Randall,
>>
>> could you please for a test add the following lines to the service part
>> of the slurmd.service file (or add an override file).
>>
>> Delegate=yes
>>
>>
>> Best
>> Marcus
>>
>>
>>
>> On 4/11/19 3:11 PM, Randall Radmer wrote:
>>
>> It's now distressingly simple to reproduce this, based on Kilinan's clue
>> (off topic, "Kilinan's Clue" sounds like a good title for a Hardy Boys
>> Mystery Story).
>>
>> After limited testing, seems to me that running "systemctl
>> daemon-reload"  followed by "systemctl restart slurmd" breaks it.  See
>> below:
>>
>> [computelab-305:~]$ sudo systemctl restart slurmd
>> [computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv
>> index, name
>> 0, Tesla T4
>> [computelab-305:~]$ sudo systemctl daemon-reload
>> [computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv
>> index, name
>> 0, Tesla T4
>> [computelab-305:~]$ sudo systemctl restart slurmd
>> [computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv
>> index, name
>> 0, Tesla T4
>> 1, Tesla T4
>> 2, Tesla T4
>> 3, Tesla T4
>> 4, Tesla T4
>> 5, Tesla T4
>> 6, Tesla T4
>> 7, Tesla T4
>> [computelab-305:~]$ slurmd -V
>> slurm 17.11.9-2
>>
>>
>> On Wed, Apr 10, 2019 at 3:59 PM Kilian Cavalotti <
>> kilian.cavalotti.w...@gmail.com> wrote:
>>
>>> Hi Randy!
>>>
>>> > We have a slurm cluster with a number of nodes, some of which have
>>> more than one GPU.  Users select how many or which GPUs they want with
>>> srun's "--gres" option.  Nothing fancy here, and in general this works as
>>> expected.  But starting a few days ago we've had problems on one machine.
>>> A specific user started a single-gpu session with srun, and nvidia-smi
>>> reported one GPU, as expected.  But about two hours later, he suddenly
>>> could see all GPUs with nvidia-smi.  To be clear, this is all from the
>>> iterative session provided by Slurm.  He did not ssh to the machine.  He's
>>> not running Docker.  Nothing odd as far as we can tell.
>>> >
>>> > A big problem is I've been unable to reproduce the problem.  I have
>>> confidence that what this user is telling me is correct, but I can't do
>>> much until/unless I can reproduce it.
>>>
>>> I think this kind of behavior has already been reported a few times:
>>> https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html
>>> https://bugs.schedmd.com/show_bug.cgi?id=5300
>>>
>>> As far as I can tell, it looks like this is probably systemd messing
>>> up with cgroups and deciding it's the king of cgroups on the host.
>>>
>>> You'll find more context and details in
>>> https://bugs.schedmd.com/show_bug.cgi?id=5292
>>>
>>> Cheers,
>>> --
>>> Kilian
>>>
>>>
>> --
>> Marcus Wagner, Dipl.-Inf.
>>
>> IT Center
>> Abteilung: Systeme und Betrieb
>> RWTH Aachen University
>> Seffenter Weg 23
>> 52074 Aachen
>> Tel: +49 241 80-24383
>> Fax: +49 241 80-624383wag...@itc.rwth-aachen.dewww.itc.rwth-aachen.de
>>
>>
> --
> Marcus Wagner, Dipl.-Inf.
>
> IT Center
> Abteilung: Systeme und Betrieb
> RWTH Aachen University
> Seffenter Weg 23
> 52074 Aachen
> Tel: +49 241 80-24383
> Fax: +49 241 80-624383wag...@itc.rwth-aachen.dewww.itc.rwth-aachen.de
>
>

Re: [slurm-users] How does cgroups limit user access to GPUs?

Reply via email to