Re: [slurm-users] How does cgroups limit user access to GPUs?

Marcus Wagner Thu, 11 Apr 2019 07:54:16 -0700

Hi Randall,

could you please for a test add the following lines to the service partof the slurmd.service file (or add an override file).


Delegate=yes


Best
Marcus



On 4/11/19 3:11 PM, Randall Radmer wrote:

It's now distressingly simple to reproduce this, based on Kilinan'sclue (off topic, "Kilinan's Clue" sounds like a good title for a HardyBoys Mystery Story).

After limited testing, seems to me that running "systemctldaemon-reload" followed by "systemctl restart slurmd" breaks it. Seebelow:


[computelab-305:~]$ sudo systemctl restart slurmd
[computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv
index, name
0, Tesla T4
[computelab-305:~]$ sudo systemctl daemon-reload
[computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv
index, name
0, Tesla T4
[computelab-305:~]$ sudo systemctl restart slurmd
[computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv
index, name
0, Tesla T4
1, Tesla T4
2, Tesla T4
3, Tesla T4
4, Tesla T4
5, Tesla T4
6, Tesla T4
7, Tesla T4
[computelab-305:~]$ slurmd -V
slurm 17.11.9-2

On Wed, Apr 10, 2019 at 3:59 PM Kilian Cavalotti<kilian.cavalotti.w...@gmail.com<mailto:kilian.cavalotti.w...@gmail.com>> wrote:


    Hi Randy!

    > We have a slurm cluster with a number of nodes, some of which
    have more than one GPU.  Users select how many or which GPUs they
    want with srun's "--gres" option.  Nothing fancy here, and in
    general this works as expected.  But starting a few days ago we've
    had problems on one machine.  A specific user started a single-gpu
    session with srun, and nvidia-smi reported one GPU, as expected. 
    But about two hours later, he suddenly could see all GPUs with
    nvidia-smi.  To be clear, this is all from the iterative session
    provided by Slurm.  He did not ssh to the machine.  He's not
    running Docker.  Nothing odd as far as we can tell.
    >
    > A big problem is I've been unable to reproduce the problem.  I
    have confidence that what this user is telling me is correct, but
    I can't do much until/unless I can reproduce it.

    I think this kind of behavior has already been reported a few times:
    https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html
    https://bugs.schedmd.com/show_bug.cgi?id=5300

    As far as I can tell, it looks like this is probably systemd messing
    up with cgroups and deciding it's the king of cgroups on the host.

    You'll find more context and details in
    https://bugs.schedmd.com/show_bug.cgi?id=5292

    Cheers,

--Kilian


--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de

Re: [slurm-users] How does cgroups limit user access to GPUs?

Reply via email to