Re: [slurm-users] How does cgroups limit user access to GPUs?

Kilian Cavalotti Wed, 10 Apr 2019 14:56:19 -0700

Hi Randy!

> We have a slurm cluster with a number of nodes, some of which have more than 
> one GPU.  Users select how many or which GPUs they want with srun's "--gres" 
> option.  Nothing fancy here, and in general this works as expected.  But 
> starting a few days ago we've had problems on one machine.  A specific user 
> started a single-gpu session with srun, and nvidia-smi reported one GPU, as 
> expected.  But about two hours later, he suddenly could see all GPUs with 
> nvidia-smi.  To be clear, this is all from the iterative session provided by 
> Slurm.  He did not ssh to the machine.  He's not running Docker.  Nothing odd 
> as far as we can tell.
>
> A big problem is I've been unable to reproduce the problem.  I have 
> confidence that what this user is telling me is correct, but I can't do much 
> until/unless I can reproduce it.


I think this kind of behavior has already been reported a few times:
https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html
https://bugs.schedmd.com/show_bug.cgi?id=5300

As far as I can tell, it looks like this is probably systemd messing
up with cgroups and deciding it's the king of cgroups on the host.

You'll find more context and details in
https://bugs.schedmd.com/show_bug.cgi?id=5292

Cheers,
-- 
Kilian

Re: [slurm-users] How does cgroups limit user access to GPUs?

Reply via email to