Thanks Kilian! I'll look at this today. -Randy
On Wed, Apr 10, 2019 at 3:59 PM Kilian Cavalotti < kilian.cavalotti.w...@gmail.com> wrote: > Hi Randy! > > > We have a slurm cluster with a number of nodes, some of which have more > than one GPU. Users select how many or which GPUs they want with srun's > "--gres" option. Nothing fancy here, and in general this works as > expected. But starting a few days ago we've had problems on one machine. > A specific user started a single-gpu session with srun, and nvidia-smi > reported one GPU, as expected. But about two hours later, he suddenly > could see all GPUs with nvidia-smi. To be clear, this is all from the > iterative session provided by Slurm. He did not ssh to the machine. He's > not running Docker. Nothing odd as far as we can tell. > > > > A big problem is I've been unable to reproduce the problem. I have > confidence that what this user is telling me is correct, but I can't do > much until/unless I can reproduce it. > > I think this kind of behavior has already been reported a few times: > https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html > https://bugs.schedmd.com/show_bug.cgi?id=5300 > > As far as I can tell, it looks like this is probably systemd messing > up with cgroups and deciding it's the king of cgroups on the host. > > You'll find more context and details in > https://bugs.schedmd.com/show_bug.cgi?id=5292 > > Cheers, > -- > Kilian > >