[slurm-users] nodes lingering in completion

Henderson, Brent Fri, 01 Apr 2022 09:24:17 -0700

Hi slurm experts -

I've gotten temporary access to a cluster with 1k nodes - so of course I setup 
slurm on it (v20.11.8).  :)  Small jobs are fine and go back to idle rather 
quickly.  Jobs that use all the nodes will have some 'linger' in the completing 
state for over a minute while others may take less time - but still noticeable.


Reading some older posts, I see that the epilog is a typical cause for this so 
I removed it from the config file and indeed, nodes very quickly go back to the 
idle state after the job completes.  I then created an epilog on each node in 
/tmp that just contained the bash header and exit 0 and changed my run to be 
just: 'salloc -N 1024  sleep 10'.  Even with this very simple command and 
epilog, the nodes exhibit the 'lingering' behavior before returning to idle.

Looking in the slurmd log for one of the nodes that took >60s to go back to 
idle, I see this:

[2022-03-31T20:57:44.158] Warning: Note very large processing time from 
prep_epilog: usec=75087286 began=20:56:29.070
[2022-03-31T20:57:44.158] epilog for job 43226 ran for 75 seconds

I tried upping the debug level on the slurmd side but didn't see anything 
useful.

So, I guess I have a couple questions:
- anyone seen this behavior before and know a fix?  :)
- might this issue be resolved in 21.08?  (Didn't see anything in the release 
note that talked about the epilog.)
- thoughts on how to collect some additional information on what might be 
happening on the system to slow down the epilog?

Thanks,

Brent

ClusterName=test
ControlMachine=test01
ControlAddr=172.23.0.1
SlurmctldPort=6819
SlurmdPort=6820
SlurmUser=slurm
AuthType=auth/munge
AccountingStorageType=accounting_storage/none
AcctGatherNodeFreq=0
AcctGatherEnergyType=acct_gather_energy/none
JobAcctGatherFrequency=Task=0,Energy=0,Network=0,Filesystem=0
JobAcctGatherType=jobacct_gather/none
ProcTrackType=proctrack/linuxproc
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
SlurmctldDebug=verbose
SlurmdDebug=verbose
#SlurmdLogFile=/var/log/slurmd.log
SlurmdTimeout=600
SlurmctldTimeout=120
StateSaveLocation=/var/lib/slurm_state
TaskPlugin=task/affinity,task/cgroup
Prolog=/tmp/prolog.sh
Epilog=/tmp/epilog.sh
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdLogFile=/var/log/slurmd.log
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
UnkillableStepTimeout=120
KillOnBadExit=1
TreeWidth=45

NodeName=n[1-1024] CPUs=256 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 
ThreadsPerCore=2 RealMemory=514741
PartitionName=all Nodes=n[1-1024] Default=Yes OverSubscribe=EXCLUSIVE 
MaxTime=INFINITE State=UP

[slurm-users] nodes lingering in completion

Reply via email to