[slurm-users] Re: diagnosing why interactive/non-interactive job waits are so long with State=MIXED

Robert Kudyba via slurm-users Wed, 05 Jun 2024 06:19:20 -0700

>
>
> Your bf_window may be too small.  From 'man slurm.conf':
>
>   bf_window=#
>
>          The number of minutes into the future to look when considering
>          jobs to schedule.  Higher values result in more overhead and
>          less responsiveness.  A value at least as long as the highest
>          allowed time limit is generally advisable to prevent job
>          starvation.  In order to limit the amount of data managed by
>          the backfill scheduler, if the value of bf_window is increased,
>          then it is generally advisable to also increase bf_resolution.
>          This option applies only to SchedulerType=sched/backfill.
>          Default: 1440 (1 day), Min: 1, Max: 43200 (30 days).
>


So since we have a 5 day option should bf_window=7200? What
should bf_resolution be set to then?

But how does this affect/improve wait times?



>
> >  On Tue, Jun 4, 2024 at 4:13 PM Ryan Novosielski <novos...@rutgers.edu>
> wrote:
> >
> >  This is relatively true of my system as well, and I believe it’s that
> the backfill schedule is slower than the main scheduler.
> >
> >  --
> >  #BlackLivesMatter
> >  ____
> >  || \\UTGERS,
>  |---------------------------*O*---------------------------
> >  ||_// the State  |         Ryan Novosielski - novos...@rutgers.edu
> >  || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
> Campus
> >  ||  \\    of NJ  | Office of Advanced Research Computing - MSB A555B,
> Newark
> >       `'
> >
> >  On Jun 4, 2024, at 16:03, Robert Kudyba via slurm-users <
> slurm-users@lists.schedmd.com> wrote:
> >
> >  At the moment we have 2 nodes that are having long wait times.
> Generally this is when the nodes are fully allocated. What would be the
> other
> >  reasons if there is still enough available memory and CPU available,
> that a job would take so long? Slurm version is  23.02.4 via Bright
> >  Computing. Note the compute nodes have hyperthreading enabled but that
> should be irrelevant. Is there a way to determine what else could
> >  be holding jobs up?
> >
> >  srun --pty  -t 0-01:00:00 --nodelist=node001 --gres=gpu:1 -A ourts -p
> short /bin/bash
> >  srun: job 672204 queued and waiting for resources
> >
> >   scontrol show node node001
> >  NodeName=m001 Arch=x86_64 CoresPerSocket=48
> >     CPUAlloc=24 CPUEfctv=192 CPUTot=192 CPULoad=20.37
> >     AvailableFeatures=location=local
> >     ActiveFeatures=location=local
> >     Gres=gpu:A6000:8
> >     NodeAddr=node001 NodeHostName=node001 Version=23.02.4
> >     OS=Linux 5.14.0-70.13.1.el9_0.x86_64 #1 SMP PREEMPT Thu Apr 14
> 12:42:38 EDT 2022
> >     RealMemory=1031883 AllocMem=1028096 FreeMem=222528 Sockets=2 Boards=1
> >     State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
> >     Partitions=ours,short
> >     BootTime=2024-04-29T16:18:30 SlurmdStartTime=2024-05-18T16:48:11
> >     LastBusyTime=2024-06-03T10:49:49 ResumeAfterTime=None
> >     CfgTRES=cpu=192,mem=1031883M,billing=192,gres/gpu=8
> >     AllocTRES=cpu=24,mem=1004G,gres/gpu=2,gres/gpu:a6000=2
> >     CapWatts=n/a
> >     CurrentWatts=0 AveWatts=0
> >     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >
> >  grep 672204 /var/log/slurmctld
> >  [2024-06-04T15:50:35.627] sched: _slurm_rpc_allocate_resources
> JobId=672204 NodeList=(null) usec=852
> >
> >  --
> >  slurm-users mailing list -- slurm-users@lists.schedmd.com
> >  To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
> --
> Dr. Loris Bennett (Herr/Mr)
> FUB-IT (ex-ZEDAT), Freie Universität Berlin
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: diagnosing why interactive/non-interactive job waits are so long with State=MIXED

Reply via email to