[slurm-users] Re: diagnosing why interactive/non-interactive job waits are so long with State=MIXED

Ryan Novosielski via slurm-users Tue, 04 Jun 2024 13:44:46 -0700

We do have bf_continue set. And also bf_max_job_user=50, because we discovered 
that one user can submit so many jobs that it will hit the limit of the number 
it’s going to consider and not run some jobs that it could otherwise run.


On Jun 4, 2024, at 16:20, Robert Kudyba <rkud...@fordham.edu> wrote:

Thanks for the quick response Ryan!

Are there any recommendations for bf_ options from 
https://slurm.schedmd.com/sched_config.html that could help with this? 
bf_continue? Decreasing bf_interval= to a value lower than 30?

On Tue, Jun 4, 2024 at 4:13 PM Ryan Novosielski 
<novos...@rutgers.edu<mailto:novos...@rutgers.edu>> wrote:
This is relatively true of my system as well, and I believe it’s that the 
backfill schedule is slower than the main scheduler.

--
#BlackLivesMatter
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
     `'

On Jun 4, 2024, at 16:03, Robert Kudyba via slurm-users 
<slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>> wrote:

At the moment we have 2 nodes that are having long wait times. Generally this 
is when the nodes are fully allocated. What would be the other reasons if there 
is still enough available memory and CPU available, that a job would take so 
long? Slurm version is  23.02.4 via Bright Computing. Note the compute nodes 
have hyperthreading enabled but that should be irrelevant. Is there a way to 
determine what else could be holding jobs up?

srun --pty  -t 0-01:00:00 --nodelist=node001 --gres=gpu:1 -A ourts -p short 
/bin/bash
srun: job 672204 queued and waiting for resources

 scontrol show node node001
NodeName=m001 Arch=x86_64 CoresPerSocket=48
   CPUAlloc=24 CPUEfctv=192 CPUTot=192 CPULoad=20.37
   AvailableFeatures=location=local
   ActiveFeatures=location=local
   Gres=gpu:A6000:8
   NodeAddr=node001 NodeHostName=node001 Version=23.02.4
   OS=Linux 5.14.0-70.13.1.el9_0.x86_64 #1 SMP PREEMPT Thu Apr 14 12:42:38 EDT 
2022
   RealMemory=1031883 AllocMem=1028096 FreeMem=222528 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=ours,short
   BootTime=2024-04-29T16:18:30 SlurmdStartTime=2024-05-18T16:48:11
   LastBusyTime=2024-06-03T10:49:49 ResumeAfterTime=None
   CfgTRES=cpu=192,mem=1031883M,billing=192,gres/gpu=8
   AllocTRES=cpu=24,mem=1004G,gres/gpu=2,gres/gpu:a6000=2
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

grep 672204 /var/log/slurmctld
[2024-06-04T15:50:35.627] sched: _slurm_rpc_allocate_resources JobId=672204 
NodeList=(null) usec=852

--
slurm-users mailing list -- 
slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>
To unsubscribe send an email to 
slurm-users-le...@lists.schedmd.com<mailto:slurm-users-le...@lists.schedmd.com>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: diagnosing why interactive/non-interactive job waits are so long with State=MIXED

Reply via email to