Here's how we handle this here:

Create a separate partition named debug that also contains that node. Give the debug partition a very short timelimit, say 30 - 60 minutes. Long enough for debugging, but too short to do any real work. Make the priority of the debug partition much higher than the regular partition. With that set up, they may not get a GPU right away, but their job should go to the head of the queue so as soon as one becomes available, their job will get it.


--
Prentice


On 4/24/19 11:06 AM, Mike Cammilleri wrote:
Hi everyone,

We have a single node with 8 gpus. Users often pile up lots of pending jobs and are using all 8 at the same time, but for a user who just wants to do a short run debug job and needs one of the gpus, they are having to wait too long for a gpu to free up. Is there a way with gres.conf or qos to limit the number of concurrent gpus in use for all users? Most jobs submitted are single jobs, so they request a gpu with --gres=gpu:1 but submit many (no array), and our gres.conf looks like the following

Name=gpu File=/dev/nvidia0 #CPUs=0,1,2,3
Name=gpu File=/dev/nvidia1 #CPUs=4,5,6,7
Name=gpu File=/dev/nvidia2 #CPUs=8,9,10,11
Name=gpu File=/dev/nvidia3 #CPUs=12,13,14,15
Name=gpu File=/dev/nvidia4 #CPUs=16,17,18,19
Name=gpu File=/dev/nvidia5 #CPUs=20,21,22,23
Name=gpu File=/dev/nvidia6 #CPUs=24,25,26,27
Name=gpu File=/dev/nvidia7 #CPUs=28,29,30,31

I thought of insisting that they submit the jobs as an array and limit with %7, but maybe there's a more elegant solution using the config.

Any tips appreciated.

Mike Cammilleri

Systems Administrator

Department of Statistics | UW-Madison

1300 University Ave | Room 1280
608-263-6673 | mi...@stat.wisc.edu

Reply via email to