> > > Your bf_window may be too small. From 'man slurm.conf': > > bf_window=# > > The number of minutes into the future to look when considering > jobs to schedule. Higher values result in more overhead and > less responsiveness. A value at least as long as the highest > allowed time limit is generally advisable to prevent job > starvation. In order to limit the amount of data managed by > the backfill scheduler, if the value of bf_window is increased, > then it is generally advisable to also increase bf_resolution. > This option applies only to SchedulerType=sched/backfill. > Default: 1440 (1 day), Min: 1, Max: 43200 (30 days). >
So since we have a 5 day option should bf_window=7200? What should bf_resolution be set to then? But how does this affect/improve wait times? > > > On Tue, Jun 4, 2024 at 4:13 PM Ryan Novosielski <novos...@rutgers.edu> > wrote: > > > > This is relatively true of my system as well, and I believe it’s that > the backfill schedule is slower than the main scheduler. > > > > -- > > #BlackLivesMatter > > ____ > > || \\UTGERS, > |---------------------------*O*--------------------------- > > ||_// the State | Ryan Novosielski - novos...@rutgers.edu > > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS > Campus > > || \\ of NJ | Office of Advanced Research Computing - MSB A555B, > Newark > > `' > > > > On Jun 4, 2024, at 16:03, Robert Kudyba via slurm-users < > slurm-users@lists.schedmd.com> wrote: > > > > At the moment we have 2 nodes that are having long wait times. > Generally this is when the nodes are fully allocated. What would be the > other > > reasons if there is still enough available memory and CPU available, > that a job would take so long? Slurm version is 23.02.4 via Bright > > Computing. Note the compute nodes have hyperthreading enabled but that > should be irrelevant. Is there a way to determine what else could > > be holding jobs up? > > > > srun --pty -t 0-01:00:00 --nodelist=node001 --gres=gpu:1 -A ourts -p > short /bin/bash > > srun: job 672204 queued and waiting for resources > > > > scontrol show node node001 > > NodeName=m001 Arch=x86_64 CoresPerSocket=48 > > CPUAlloc=24 CPUEfctv=192 CPUTot=192 CPULoad=20.37 > > AvailableFeatures=location=local > > ActiveFeatures=location=local > > Gres=gpu:A6000:8 > > NodeAddr=node001 NodeHostName=node001 Version=23.02.4 > > OS=Linux 5.14.0-70.13.1.el9_0.x86_64 #1 SMP PREEMPT Thu Apr 14 > 12:42:38 EDT 2022 > > RealMemory=1031883 AllocMem=1028096 FreeMem=222528 Sockets=2 Boards=1 > > State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A > MCS_label=N/A > > Partitions=ours,short > > BootTime=2024-04-29T16:18:30 SlurmdStartTime=2024-05-18T16:48:11 > > LastBusyTime=2024-06-03T10:49:49 ResumeAfterTime=None > > CfgTRES=cpu=192,mem=1031883M,billing=192,gres/gpu=8 > > AllocTRES=cpu=24,mem=1004G,gres/gpu=2,gres/gpu:a6000=2 > > CapWatts=n/a > > CurrentWatts=0 AveWatts=0 > > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > > > grep 672204 /var/log/slurmctld > > [2024-06-04T15:50:35.627] sched: _slurm_rpc_allocate_resources > JobId=672204 NodeList=(null) usec=852 > > > > -- > > slurm-users mailing list -- slurm-users@lists.schedmd.com > > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com > -- > Dr. Loris Bennett (Herr/Mr) > FUB-IT (ex-ZEDAT), Freie Universität Berlin >
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com