Thanks for the quick response Ryan!

Are there any recommendations for bf_ options from
https://slurm.schedmd.com/sched_config.html that could help with this?
bf_continue? Decreasing bf_interval= to a value lower than 30?

On Tue, Jun 4, 2024 at 4:13 PM Ryan Novosielski <novos...@rutgers.edu>
wrote:

> This is relatively true of my system as well, and I believe it’s that the
> backfill schedule is slower than the main scheduler.
>
> --
> #BlackLivesMatter
> ____
> || \\UTGERS,     |---------------------------*O*---------------------------
> ||_// the State  |         Ryan Novosielski - novos...@rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\    of NJ  | Office of Advanced Research Computing - MSB
> A555B, Newark
>      `'
>
> On Jun 4, 2024, at 16:03, Robert Kudyba via slurm-users <
> slurm-users@lists.schedmd.com> wrote:
>
> At the moment we have 2 nodes that are having long wait times. Generally
> this is when the nodes are fully allocated. What would be the other reasons
> if there is still enough available memory and CPU available, that a
> job would take so long? Slurm version is  23.02.4 via Bright Computing.
> Note the compute nodes have hyperthreading enabled but that should be
> irrelevant. Is there a way to determine what else could be holding jobs up?
>
> srun --pty  -t 0-01:00:00 --nodelist=node001 --gres=gpu:1 -A ourts -p
> short /bin/bash
> srun: job 672204 queued and waiting for resources
>
>  scontrol show node node001
> NodeName=m001 Arch=x86_64 CoresPerSocket=48
>    CPUAlloc=24 CPUEfctv=192 CPUTot=192 CPULoad=20.37
>    AvailableFeatures=location=local
>    ActiveFeatures=location=local
>    Gres=gpu:A6000:8
>    NodeAddr=node001 NodeHostName=node001 Version=23.02.4
>    OS=Linux 5.14.0-70.13.1.el9_0.x86_64 #1 SMP PREEMPT Thu Apr 14 12:42:38
> EDT 2022
>    RealMemory=1031883 AllocMem=1028096 FreeMem=222528 Sockets=2 Boards=1
>    State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>    Partitions=ours,short
>    BootTime=2024-04-29T16:18:30 SlurmdStartTime=2024-05-18T16:48:11
>    LastBusyTime=2024-06-03T10:49:49 ResumeAfterTime=None
>    CfgTRES=cpu=192,mem=1031883M,billing=192,gres/gpu=8
>    AllocTRES=cpu=24,mem=1004G,gres/gpu=2,gres/gpu:a6000=2
>    CapWatts=n/a
>    CurrentWatts=0 AveWatts=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> grep 672204 /var/log/slurmctld
> [2024-06-04T15:50:35.627] sched: _slurm_rpc_allocate_resources
> JobId=672204 NodeList=(null) usec=852
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>
>
>
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to