Re: [slurm-users] nodes lingering in completion

William Brown Fri, 01 Apr 2022 10:35:01 -0700

To process the epilog a Bash process must be created so perhaps look at
.bashrc.


Try timing running the epilog yourself on a compute node.  I presume it is
owned by an account local to the compute nodes, not a directory service
account?

William

On Fri, 1 Apr 2022, 17:25 Henderson, Brent, <brent.hender...@hpe.com> wrote:

> Hi slurm experts -
>
>
>
> I’ve gotten temporary access to a cluster with 1k nodes - so of course I
> setup slurm on it (v20.11.8).  J  Small jobs are fine and go back to idle
> rather quickly.  Jobs that use all the nodes will have some ‘linger’ in the
> completing state for over a minute while others may take less time - but
> still noticeable.
>
>
>
> Reading some older posts, I see that the epilog is a typical cause for
> this so I removed it from the config file and indeed, nodes very quickly go
> back to the idle state after the job completes.  I then created an epilog
> on each node in /tmp that just contained the bash header and exit 0 and
> changed my run to be just: ‘salloc -N 1024  sleep 10’.  Even with this very
> simple command and epilog, the nodes exhibit the ‘lingering’ behavior
> before returning to idle.
>
>
>
> Looking in the slurmd log for one of the nodes that took >60s to go back
> to idle, I see this:
>
>
>
> [2022-03-31T20:57:44.158] Warning: Note very large processing time from
> prep_epilog: usec=75087286 began=20:56:29.070
>
> [2022-03-31T20:57:44.158] epilog for job 43226 ran for 75 seconds
>
>
>
> I tried upping the debug level on the slurmd side but didn’t see anything
> useful.
>
>
>
> So, I guess I have a couple questions:
>
> - anyone seen this behavior before and know a fix?  :)
>
> - might this issue be resolved in 21.08?  (Didn’t see anything in the
> release note that talked about the epilog.)
>
> - thoughts on how to collect some additional information on what might be
> happening on the system to slow down the epilog?
>
>
>
> Thanks,
>
>
>
> Brent
>
>
>

Re: [slurm-users] nodes lingering in completion

Reply via email to