Yes, UGE supports this out of the box. Depending on whether the job is a 
regular job or a Docker container the method used to restrict access only to 
the assigned GPU is slightly different. UGE also will only schedule jobs to 
nodes where it is guaranteed to be able doing this.

The interface for configuring this are some fairly versatile extensions to 
RSMAPs as pointed to by Skylar.

Cheers,

Fritz

> Am 14.08.2019 um 17:16 schrieb Tina Friedrich <tina.friedr...@it.ox.ac.uk>:
> 
> Hello,
> 
> from a kernel/mechanism point of view, it is perfectly possible to 
> restrict device access using cgroups. I use that on my current cluster, 
> works really well (both for things like CPU cores and GPUs - you only 
> see what you request, even using something like 'nvidia-smi').
> 
> Sadly, my current cluster isn't Grid Engine based :( and I have no idea 
> if SoGE or UGE support doing so out of the box - I've never had to do 
> that whilst still working with Grid Engine. Wouldn't be surprised if UGE 
> can do it.
> 
> You could probably script something yourself - I know I made a custom 
> suspend method once that used cgroups for non-MPI jobs.
> 
> Tina
> 
> On 14/08/2019 15:35, Andreas Haupt wrote:
>> Hi Dj,
>> 
>> we do this by setting $CUDA_VISIBLE_DEVICES in a prolog script (and
>> according to what has been requested by the job).
>> 
>> Preventing access to the 'wrong' gpu devices by "malicious jobs" is not
>> that easy. An idea could be to e.g. play with device permissions.
>> 
>> Cheers,
>> Andreas
>> 
>> On Wed, 2019-08-14 at 10:21 -0400, Dj Merrill wrote:
>>> To date in our HPC Grid running Son of Grid Engine 8.1.9, we've had
>>> single Nvidia GPU cards per compute node.  We are contemplating the
>>> purchase of a single compute node that has multiple GPU cards in it, and
>>> want to ensure that running jobs only have access to the GPU resources
>>> they ask for, and don't take over all of the GPU cards in the system.
>>> 
>>> We define gpu as a resource:
>>> qconf -sc:
>>> #name               shortcut   type      relop   requestable consumable
>>> default  urgency
>>> gpu                 gpu        INT       <=      YES         YES    0
>>>     0
>>> 
>>> We define GPU persistence mode and exclusive process on each node:
>>> nvidia-smi -pm 1
>>> nvidia-smi -c 3
>>> 
>>> We set the number of GPUs in the host definition:
>>> qconf -me (hostname)
>>> 
>>> complex_values   gpu=1   for our existing nodes, and this setup has been
>>> working fine for us.
>>> 
>>> With the new system, we would set:
>>> complex_values   gpu=4
>>> 
>>> 
>>> If a job is submitted asking for one GPU, will it be limited to only
>>> having access to a single GPU card on the system, or can it detect the
>>> other cards and take up all four (and if so how do we prevent that)?
>>> 
>>> Is there something like "cgroups" for gpus?
>>> 
>>> Thanks,
>>> 
>>> -Dj
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users@gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users@gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
> 
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to