To process the epilog a Bash process must be created so perhaps look at .bashrc.
Try timing running the epilog yourself on a compute node. I presume it is owned by an account local to the compute nodes, not a directory service account? William On Fri, 1 Apr 2022, 17:25 Henderson, Brent, <brent.hender...@hpe.com> wrote: > Hi slurm experts - > > > > I’ve gotten temporary access to a cluster with 1k nodes - so of course I > setup slurm on it (v20.11.8). J Small jobs are fine and go back to idle > rather quickly. Jobs that use all the nodes will have some ‘linger’ in the > completing state for over a minute while others may take less time - but > still noticeable. > > > > Reading some older posts, I see that the epilog is a typical cause for > this so I removed it from the config file and indeed, nodes very quickly go > back to the idle state after the job completes. I then created an epilog > on each node in /tmp that just contained the bash header and exit 0 and > changed my run to be just: ‘salloc -N 1024 sleep 10’. Even with this very > simple command and epilog, the nodes exhibit the ‘lingering’ behavior > before returning to idle. > > > > Looking in the slurmd log for one of the nodes that took >60s to go back > to idle, I see this: > > > > [2022-03-31T20:57:44.158] Warning: Note very large processing time from > prep_epilog: usec=75087286 began=20:56:29.070 > > [2022-03-31T20:57:44.158] epilog for job 43226 ran for 75 seconds > > > > I tried upping the debug level on the slurmd side but didn’t see anything > useful. > > > > So, I guess I have a couple questions: > > - anyone seen this behavior before and know a fix? :) > > - might this issue be resolved in 21.08? (Didn’t see anything in the > release note that talked about the epilog.) > > - thoughts on how to collect some additional information on what might be > happening on the system to slow down the epilog? > > > > Thanks, > > > > Brent > > >