We put a ‘gpu’ QOS on all our GPU partitions, and limit jobs per user to 8 (our 
GPU capacity) via MaxJobsPerUser. Extra jobs get blocked, allowing other users 
to queue jobs ahead of the extras.

# sacctmgr show qos gpu format=name,maxjobspu
      Name MaxJobsPU
---------- ---------
       gpu         8

-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601     / Tennessee Tech University

> On Apr 24, 2019, at 10:06 AM, Mike Cammilleri <mi...@stat.wisc.edu> wrote:
> 
> Hi everyone,
> 
> We have a single node with 8 gpus. Users often pile up lots of pending jobs 
> and are using all 8 at the same time, but for a user who just wants to do a 
> short run debug job and needs one of the gpus, they are having to wait too 
> long for a gpu to free up. Is there a way with gres.conf or qos to limit the 
> number of concurrent gpus in use for all users? Most jobs submitted are 
> single jobs, so they request a gpu with --gres=gpu:1 but submit many (no 
> array), and our gres.conf looks like the following
> 
> Name=gpu File=/dev/nvidia0 #CPUs=0,1,2,3
> Name=gpu File=/dev/nvidia1 #CPUs=4,5,6,7
> Name=gpu File=/dev/nvidia2 #CPUs=8,9,10,11
> Name=gpu File=/dev/nvidia3 #CPUs=12,13,14,15
> Name=gpu File=/dev/nvidia4 #CPUs=16,17,18,19
> Name=gpu File=/dev/nvidia5 #CPUs=20,21,22,23
> Name=gpu File=/dev/nvidia6 #CPUs=24,25,26,27
> Name=gpu File=/dev/nvidia7 #CPUs=28,29,30,31
> 
> I thought of insisting that they submit the jobs as an array and limit with 
> %7, but maybe there's a more elegant solution using the config.
> 
> Any tips appreciated.
> 
> Mike Cammilleri
> Systems Administrator
> Department of Statistics | UW-Madison
> 1300 University Ave | Room 1280
> 608-263-6673 | mi...@stat.wisc.edu

Reply via email to