On Wed, 14 Aug 2019 at 7:21am, Dj Merrill wrote
To date in our HPC Grid running Son of Grid Engine 8.1.9, we've had single Nvidia GPU cards per compute node. We are contemplating the purchase of a single compute node that has multiple GPU cards in it, and want to ensure that running jobs only have access to the GPU resources they ask for, and don't take over all of the GPU cards in the system.
We use epilog and prolog scripts based on <https://github.com/kyamagu/sge-gpuprolog> to assign GPUs to jobs. It's (obviously) up to the users' scripts to honor the assignments, but it's been working for us so far.
We define gpu as a resource: qconf -sc: #name shortcut type relop requestable consumable default urgency gpu gpu INT <= YES YES 0 0
We *used* to run this way until we ran into what seems like a bug in SoGE 8.1.9. See <http://gridengine.org/pipermail/users/2018-April/010116.html> and the ensuing thread for details, but the summary is that SGE would insist on trying to run a job on a particular node, even if there were free GPUs elsewhere. It was happening so often that we had to change our approach, and defined a queue on each GPU node with the same number of slots as GPUs. It's a far from perfect system, but it's working for now.
-- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users