Hi everyone,

We have a single node with 8 gpus. Users often pile up lots of pending jobs and 
are using all 8 at the same time, but for a user who just wants to do a short 
run debug job and needs one of the gpus, they are having to wait too long for a 
gpu to free up. Is there a way with gres.conf or qos to limit the number of 
concurrent gpus in use for all users? Most jobs submitted are single jobs, so 
they request a gpu with --gres=gpu:1 but submit many (no array), and our 
gres.conf looks like the following

Name=gpu File=/dev/nvidia0 #CPUs=0,1,2,3
Name=gpu File=/dev/nvidia1 #CPUs=4,5,6,7
Name=gpu File=/dev/nvidia2 #CPUs=8,9,10,11
Name=gpu File=/dev/nvidia3 #CPUs=12,13,14,15
Name=gpu File=/dev/nvidia4 #CPUs=16,17,18,19
Name=gpu File=/dev/nvidia5 #CPUs=20,21,22,23
Name=gpu File=/dev/nvidia6 #CPUs=24,25,26,27
Name=gpu File=/dev/nvidia7 #CPUs=28,29,30,31

I thought of insisting that they submit the jobs as an array and limit with %7, 
but maybe there's a more elegant solution using the config.

Any tips appreciated.


Mike Cammilleri

Systems Administrator

Department of Statistics | UW-Madison

1300 University Ave | Room 1280
608-263-6673 | mi...@stat.wisc.edu

Reply via email to