Thanks for the quick response Ryan! Are there any recommendations for bf_ options from https://slurm.schedmd.com/sched_config.html that could help with this? bf_continue? Decreasing bf_interval= to a value lower than 30?
On Tue, Jun 4, 2024 at 4:13 PM Ryan Novosielski <novos...@rutgers.edu> wrote: > This is relatively true of my system as well, and I believe it’s that the > backfill schedule is slower than the main scheduler. > > -- > #BlackLivesMatter > ____ > || \\UTGERS, |---------------------------*O*--------------------------- > ||_// the State | Ryan Novosielski - novos...@rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\ of NJ | Office of Advanced Research Computing - MSB > A555B, Newark > `' > > On Jun 4, 2024, at 16:03, Robert Kudyba via slurm-users < > slurm-users@lists.schedmd.com> wrote: > > At the moment we have 2 nodes that are having long wait times. Generally > this is when the nodes are fully allocated. What would be the other reasons > if there is still enough available memory and CPU available, that a > job would take so long? Slurm version is 23.02.4 via Bright Computing. > Note the compute nodes have hyperthreading enabled but that should be > irrelevant. Is there a way to determine what else could be holding jobs up? > > srun --pty -t 0-01:00:00 --nodelist=node001 --gres=gpu:1 -A ourts -p > short /bin/bash > srun: job 672204 queued and waiting for resources > > scontrol show node node001 > NodeName=m001 Arch=x86_64 CoresPerSocket=48 > CPUAlloc=24 CPUEfctv=192 CPUTot=192 CPULoad=20.37 > AvailableFeatures=location=local > ActiveFeatures=location=local > Gres=gpu:A6000:8 > NodeAddr=node001 NodeHostName=node001 Version=23.02.4 > OS=Linux 5.14.0-70.13.1.el9_0.x86_64 #1 SMP PREEMPT Thu Apr 14 12:42:38 > EDT 2022 > RealMemory=1031883 AllocMem=1028096 FreeMem=222528 Sockets=2 Boards=1 > State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A > Partitions=ours,short > BootTime=2024-04-29T16:18:30 SlurmdStartTime=2024-05-18T16:48:11 > LastBusyTime=2024-06-03T10:49:49 ResumeAfterTime=None > CfgTRES=cpu=192,mem=1031883M,billing=192,gres/gpu=8 > AllocTRES=cpu=24,mem=1004G,gres/gpu=2,gres/gpu:a6000=2 > CapWatts=n/a > CurrentWatts=0 AveWatts=0 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > grep 672204 /var/log/slurmctld > [2024-06-04T15:50:35.627] sched: _slurm_rpc_allocate_resources > JobId=672204 NodeList=(null) usec=852 > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com > > >
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com