I assume without Delegate=yes this would happen also to regular jobs, which means, nightly updates could "destroy" the cgroups created by slurm and therefore let the jobs out "into the wild".

Best
Marcus

P.S.:
We had a similar problem with LSF

On 4/11/19 3:58 PM, Randall Radmer wrote:
Yes, I was just testing that.  Adding "Delegate=yes" seems to fix the problem (see below), but wanted to try a few more things before saying anything.

[computelab-136:~]$ grep ^Delegate /etc/systemd/system/slurmd.service
Delegate=yes
[computelab-136:~]$ nvidia-smi --query-gpu=index,name --format=csv
index, name
0, Tesla T4
[computelab-136:~]$ sudo systemctl daemon-reload; sudo systemctl restart slurmd
[computelab-136:~]$ nvidia-smi --query-gpu=index,name --format=csv
index, name
0, Tesla T4



On Thu, Apr 11, 2019 at 7:53 AM Marcus Wagner <wag...@itc.rwth-aachen.de <mailto:wag...@itc.rwth-aachen.de>> wrote:

    Hi Randall,

    could you please for a test add the following lines to the service
    part of the slurmd.service file (or add an override file).

    Delegate=yes


    Best
    Marcus



    On 4/11/19 3:11 PM, Randall Radmer wrote:
    It's now distressingly simple to reproduce this, based on
    Kilinan's clue (off topic, "Kilinan's Clue" sounds like a good
    title for a Hardy Boys Mystery Story).

    After limited testing, seems to me that running "systemctl
    daemon-reload"  followed by "systemctl restart slurmd" breaks
    it.  See below:

    [computelab-305:~]$ sudo systemctl restart slurmd
    [computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv
    index, name
    0, Tesla T4
    [computelab-305:~]$ sudo systemctl daemon-reload
    [computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv
    index, name
    0, Tesla T4
    [computelab-305:~]$ sudo systemctl restart slurmd
    [computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv
    index, name
    0, Tesla T4
    1, Tesla T4
    2, Tesla T4
    3, Tesla T4
    4, Tesla T4
    5, Tesla T4
    6, Tesla T4
    7, Tesla T4
    [computelab-305:~]$ slurmd -V
    slurm 17.11.9-2


    On Wed, Apr 10, 2019 at 3:59 PM Kilian Cavalotti
    <kilian.cavalotti.w...@gmail.com
    <mailto:kilian.cavalotti.w...@gmail.com>> wrote:

        Hi Randy!

        > We have a slurm cluster with a number of nodes, some of
        which have more than one GPU.  Users select how many or which
        GPUs they want with srun's "--gres" option.  Nothing fancy
        here, and in general this works as expected.  But starting a
        few days ago we've had problems on one machine.  A specific
        user started a single-gpu session with srun, and nvidia-smi
        reported one GPU, as expected.  But about two hours later, he
        suddenly could see all GPUs with nvidia-smi.  To be clear,
        this is all from the iterative session provided by Slurm.  He
        did not ssh to the machine.  He's not running Docker. 
        Nothing odd as far as we can tell.
        >
        > A big problem is I've been unable to reproduce the
        problem.  I have confidence that what this user is telling me
        is correct, but I can't do much until/unless I can reproduce it.

        I think this kind of behavior has already been reported a few
        times:
        https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html
        https://bugs.schedmd.com/show_bug.cgi?id=5300

        As far as I can tell, it looks like this is probably systemd
        messing
        up with cgroups and deciding it's the king of cgroups on the
        host.

        You'll find more context and details in
        https://bugs.schedmd.com/show_bug.cgi?id=5292

        Cheers,
-- Kilian


-- Marcus Wagner, Dipl.-Inf.

    IT Center
    Abteilung: Systeme und Betrieb
    RWTH Aachen University
    Seffenter Weg 23
    52074 Aachen
    Tel: +49 241 80-24383
    Fax: +49 241 80-624383
    wag...@itc.rwth-aachen.de  <mailto:wag...@itc.rwth-aachen.de>
    www.itc.rwth-aachen.de  <http://www.itc.rwth-aachen.de>


--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de

Reply via email to