Re: [slurm-users] Limit concurrent gpu resources

Prentice Bisbal Wed, 24 Apr 2019 11:15:25 -0700

Here's how we handle this here:

Create a separate partition named debug that also contains that node.Give the debug partition a very short timelimit, say 30 - 60 minutes.Long enough for debugging, but too short to do any real work. Make thepriority of the debug partition much higher than the regular partition.With that set up, they may not get a GPU right away, but their jobshould go to the head of the queue so as soon as one becomes available,their job will get it.



--
Prentice


On 4/24/19 11:06 AM, Mike Cammilleri wrote:

Hi everyone,
We have a single node with 8 gpus. Users often pile up lots of pendingjobs and are using all 8 at the same time, but for a user who justwants to do a short run debug job and needs one of the gpus, they arehaving to wait too long for a gpu to free up. Is there a way withgres.conf or qos to limit the number of concurrent gpus in use for allusers? Most jobs submitted are single jobs, so they request a gpu with--gres=gpu:1 but submit many (no array), and our gres.conf looks likethe following
Name=gpu File=/dev/nvidia0 #CPUs=0,1,2,3
Name=gpu File=/dev/nvidia1 #CPUs=4,5,6,7
Name=gpu File=/dev/nvidia2 #CPUs=8,9,10,11
Name=gpu File=/dev/nvidia3 #CPUs=12,13,14,15
Name=gpu File=/dev/nvidia4 #CPUs=16,17,18,19
Name=gpu File=/dev/nvidia5 #CPUs=20,21,22,23
Name=gpu File=/dev/nvidia6 #CPUs=24,25,26,27
Name=gpu File=/dev/nvidia7 #CPUs=28,29,30,31
I thought of insisting that they submit the jobs as an array and limitwith %7, but maybe there's a more elegant solution using the config.
Any tips appreciated.

Mike Cammilleri

Systems Administrator

Department of Statistics | UW-Madison

1300 University Ave | Room 1280
608-263-6673 | mi...@stat.wisc.edu

Re: [slurm-users] Limit concurrent gpu resources

Reply via email to