Hi DJ,

I'm not sure if SoGE supports it, but UGE has the concept of "resource
maps" (aka RSMAP) complexes which we use to assign specific hardware
resources to specific jobs. It functions sort of as a hybrid array/scalar
consumable.

It looks like this in the host complex_values configuration:

cuda=4(0-3)

Which gives four CUDA-capable devices, with IDs 0-3. UGE sets SGE_HGR_cuda in
the job environment to the assigned job ID:

$ echo "${SGE_HGR_cuda}"
0

When you look at it as a consumable, it is just an integer value, though:

n030                    lx-amd64       24    2   12   24  0.01  757.0G   11.4G  
  8.0G   48.0K
    Host Resource(s):      hc:cuda=3.000000

Which shows three of the four GPU devices are available for use.

On Wed, Aug 14, 2019 at 10:21:12AM -0400, Dj Merrill wrote:
> To date in our HPC Grid running Son of Grid Engine 8.1.9, we've had
> single Nvidia GPU cards per compute node.  We are contemplating the
> purchase of a single compute node that has multiple GPU cards in it, and
> want to ensure that running jobs only have access to the GPU resources
> they ask for, and don't take over all of the GPU cards in the system.
> 
> We define gpu as a resource:
> qconf -sc:
> #name               shortcut   type      relop   requestable consumable
> default  urgency
> gpu                 gpu        INT       <=      YES         YES    0
>     0
> 
> We define GPU persistence mode and exclusive process on each node:
> nvidia-smi -pm 1
> nvidia-smi -c 3
> 
> We set the number of GPUs in the host definition:
> qconf -me (hostname)
> 
> complex_values   gpu=1   for our existing nodes, and this setup has been
> working fine for us.
> 
> With the new system, we would set:
> complex_values   gpu=4
> 
> 
> If a job is submitted asking for one GPU, will it be limited to only
> having access to a single GPU card on the system, or can it detect the
> other cards and take up all four (and if so how do we prevent that)?
> 
> Is there something like "cgroups" for gpus?
> 
> Thanks,
> 
> -Dj
> 
> 
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users

-- 
-- Skylar Thompson (skyl...@u.washington.edu)
-- Genome Sciences Department, System Administrator
-- Foege Building S046, (206)-685-7354
-- University of Washington School of Medicine
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to