On Friday, 25 October 2024 22:49:16 CET Kevin M. Hildebrand via slurm-users wrote: > We have a 'gpu' partition with 30 or so nodes, some with A100s, some with > H100s, and a few others. > It appears that when (for example) all of the A100 GPUs are in use, if > there are additional jobs requesting A100 GPUs pending, and those jobs have > the highest priority in the partition, then jobs submitted for H100s won't > run even if there are idle H100s. This is a small subset of our present > pending queue- the four bottom jobs should be running, but aren't. The top > pending job shows reason 'Resources' while the rest all show 'Priority'. > Any thoughts on why this might be happening? > > JOBID PRIORITY TRES_ALLOC > > 8317749 501490 > cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1 > > 8317750 501490 > cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1 > > 8317745 501490 > cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1 > > 8317746 501490 > cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1 > > 8338679 500060 > cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1 > > 8338678 500060 > cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1 > > 8338677 500060 > cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1 > > 8338676 500060 > cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1
Do you have Backfill Scheduling configured with bf_continue? regards Markus Köberl -- Markus Koeberl Graz University of Technology Signal Processing and Speech Communication Laboratory E-mail: markus.koeb...@tugraz.at
signature.asc
Description: This is a digitally signed message part.
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com