Here's how we handle this here:
Create a separate partition named debug that also contains that node.
Give the debug partition a very short timelimit, say 30 - 60 minutes.
Long enough for debugging, but too short to do any real work. Make the
priority of the debug partition much higher than the regular partition.
With that set up, they may not get a GPU right away, but their job
should go to the head of the queue so as soon as one becomes available,
their job will get it.
--
Prentice
On 4/24/19 11:06 AM, Mike Cammilleri wrote:
Hi everyone,
We have a single node with 8 gpus. Users often pile up lots of pending
jobs and are using all 8 at the same time, but for a user who just
wants to do a short run debug job and needs one of the gpus, they are
having to wait too long for a gpu to free up. Is there a way with
gres.conf or qos to limit the number of concurrent gpus in use for all
users? Most jobs submitted are single jobs, so they request a gpu with
--gres=gpu:1 but submit many (no array), and our gres.conf looks like
the following
Name=gpu File=/dev/nvidia0 #CPUs=0,1,2,3
Name=gpu File=/dev/nvidia1 #CPUs=4,5,6,7
Name=gpu File=/dev/nvidia2 #CPUs=8,9,10,11
Name=gpu File=/dev/nvidia3 #CPUs=12,13,14,15
Name=gpu File=/dev/nvidia4 #CPUs=16,17,18,19
Name=gpu File=/dev/nvidia5 #CPUs=20,21,22,23
Name=gpu File=/dev/nvidia6 #CPUs=24,25,26,27
Name=gpu File=/dev/nvidia7 #CPUs=28,29,30,31
I thought of insisting that they submit the jobs as an array and limit
with %7, but maybe there's a more elegant solution using the config.
Any tips appreciated.
Mike Cammilleri
Systems Administrator
Department of Statistics | UW-Madison
1300 University Ave | Room 1280
608-263-6673 | mi...@stat.wisc.edu