On Friday, 25 October 2024 22:49:16 CET Kevin M. Hildebrand via slurm-users 
wrote:
> We have a 'gpu' partition with 30 or so nodes, some with A100s, some with
> H100s, and a few others.
> It appears that when (for example) all of the A100 GPUs are in use, if
> there are additional jobs requesting A100 GPUs pending, and those jobs have
> the highest priority in the partition, then jobs submitted for H100s won't
> run even if there are idle H100s.  This is a small subset of our present
> pending queue- the four bottom jobs should be running, but aren't.  The top
> pending job shows reason 'Resources' while the rest all show 'Priority'.
> Any thoughts on why this might be happening?
> 
> JOBID               PRIORITY            TRES_ALLOC
> 
> 8317749             501490
>  cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1
> 
> 8317750             501490
>  cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1
> 
> 8317745             501490
>  cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1
> 
> 8317746             501490
>  cpu=4,mem=80000M,node=1,billing=48,gres/gpu=1,gres/gpu:a100=1
> 
> 8338679             500060
>  cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1
> 
> 8338678             500060
>  cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1
> 
> 8338677             500060
>  cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1
> 
> 8338676             500060
>  cpu=4,mem=64G,node=1,billing=144,gres/gpu=1,gres/gpu:h100=1


Do you have Backfill Scheduling configured with bf_continue?


regards
Markus Köberl
-- 
Markus Koeberl
Graz University of Technology
Signal Processing and Speech Communication Laboratory
E-mail: markus.koeb...@tugraz.at

Attachment: signature.asc
Description: This is a digitally signed message part.

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to