Re: [gridengine users] SGE and GPU

Ian Kaufman Mon, 14 Apr 2014 13:42:30 -0700

If everything is configured correctly, GridEngine will be aware that
the GPU in node1 is in use, and schedule around it, ensuring that the
8 GPU job will get unused GPUs.


Ian

On Mon, Apr 14, 2014 at 1:38 PM, Ian Kaufman <ikauf...@eng.ucsd.edu> wrote:
> Look at the info presented here:
>
> http://stackoverflow.com/questions/10557816/scheduling-gpu-resources-using-the-sun-grid-engine-sge
>
> Ian
>
> On Mon, Apr 14, 2014 at 1:29 PM, Feng Zhang <prod.f...@gmail.com> wrote:
>> Thanks, Ian and Gowtham!
>>
>>
>> This is a very nice instruction.  One of my problem is, for example:
>>
>> node1,  number of gpu=4
>> node2,  number of gpu=4
>> node3,  number of gpu=2
>>
>> So in total I have 10 GPUs.
>>
>> Right now, user A has a serial GPU job, which takes one GPU on
>> node1(Don't know which GPU though). So node1:3, node2:4 and node3:2
>> GPUs are still free for jobs.
>>
>> I submit one job with PE=8. SGE allocate all the 3 nodes to me with 8
>> GPU slots. The problem is now: how my job knows what GPUs it can get
>> on node1?
>>
>> Best
>>
>>
>>
>>
>> On Mon, Apr 14, 2014 at 4:13 PM, Ian Kaufman <ikauf...@eng.ucsd.edu> wrote:
>>> Again, look into using it as a consumable resource as Gowtham posted above.
>>>
>>> Ian
>>>
>>> On Mon, Apr 14, 2014 at 11:57 AM, Feng Zhang <prod.f...@gmail.com> wrote:
>>>> Thanks, Reuti,
>>>>
>>>> The socket solution looks like only work fine for serial jobs, not PE
>>>> jobs, right?
>>>>
>>>> Our cluster has different nodes, some nodes each has 2 GPUs, some
>>>> others each has 4 GPUs. Most of the user jobs are PE jobs, some are
>>>> serial.
>>>>
>>>> The socket solution can event work for PE jobs, but as my
>>>> understanding, it is not efficient? Since each node has, for example,
>>>> 4 queues. If one user submit a PE job to a queue, he/she can not use
>>>> the other GPUs on the other queues?
>>>>
>>>> On Mon, Apr 14, 2014 at 2:16 PM, Reuti <re...@staff.uni-marburg.de> wrote:
>>>>> Am 14.04.2014 um 20:06 schrieb Feng Zhang:
>>>>>
>>>>>> Thanks, Ian!
>>>>>>
>>>>>> I haven't checked the GPU load sensor in detail, either. It sounds to
>>>>>> me it only handles the number of GPU allocated to a job, but the job
>>>>>> doesn't know which GPUs it actually get and set the
>>>>>> CUDA_VISIBLE_DEVICE(some programs need this env to be set). This can
>>>>>> be done by writing some scripts/programs, but to me, it is not an
>>>>>> accurate solution, since some jobs may still happen to collide to each
>>>>>> other on the same GPU on a multiple GPU node. If GE can have the
>>>>>> memory to record the GPUs allocated to a job, then this can be
>>>>>> perfect.
>>>>>
>>>>> Like the option to request sockets instead of cores which I posted in the 
>>>>> last couple of days, you can use a similar approach to get the number of 
>>>>> the granted GPU out of the queue name.
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>> On Mon, Apr 14, 2014 at 1:46 PM, Ian Kaufman <ikauf...@eng.ucsd.edu> 
>>>>>> wrote:
>>>>>>> I believe there already is support for GPUs - there is a GPU Load
>>>>>>> Sensor in Open Grid Engine. You may have to build it yourself, I
>>>>>>> haven't checked to see if it comes pre-packaged.
>>>>>>>
>>>>>>> Univa has Phi support, and I believe OGE/OGS has it as well, or at
>>>>>>> least has been working on it.
>>>>>>>
>>>>>>> Ian
>>>>>>>
>>>>>>> On Mon, Apr 14, 2014 at 10:35 AM, Feng Zhang <prod.f...@gmail.com> 
>>>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Is there's any plan to implement the GPU resource management in SGE in
>>>>>>>> the near future? Like Slurm or Torque? There are some ways to do this
>>>>>>>> using scripts/programs, but I wonder that if the SGE itself can
>>>>>>>> recognize and manage GPU(and Phi). Not need to be complicated and
>>>>>>>> powerful, just do basic work.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users@gridengine.org
>>>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ian Kaufman
>>>>>>> Research Systems Administrator
>>>>>>> UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users@gridengine.org
>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>>
>>>
>>>
>>>
>>> --
>>> Ian Kaufman
>>> Research Systems Administrator
>>> UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu
>
>
>
> --
> Ian Kaufman
> Research Systems Administrator
> UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu



-- 
Ian Kaufman
Research Systems Administrator
UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] SGE and GPU

Reply via email to