Re: [slurm-users] job_time_limit: inactivity time limit reached ...

Ole Holm Nielsen Wed, 21 Sep 2022 00:52:01 -0700

Hi Paul,

IMHO, using logrotate is the most convenient method for making dailydatabase backup dumps and keep a number of backup versions, see the notes in

https://wiki.fysik.dtu.dk/niflheim/Slurm_database#backup-script-with-logrotate

Using --single-transaction is recommended by SchedMD to avoid raceconditions when slurmdbd is being run while taking the MySQL dump, see

https://bugs.schedmd.com/show_bug.cgi?id=10295#c18

/Ole

On 9/20/22 15:17, Paul Raines wrote:

Further investigation found that I had setup logrotate to handle a mysql
dump

   mysqldump -R --single-transaction -B slurm_db | bzip2

which is what is taking 5 minutes.  I think this is locking tables during
the time hanging calls to slurmdbd most likely and causing the issue.
I will need to rework it.

-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Mon, 19 Sep 2022 9:29am, Reed Dier wrote:
I’m not sure if this might be helpful, but my logrotate.d for slurmlooks a bit differently, namely instead of a systemctl reload, I amsending a specific SIGUSR2 signal, which is supposedly for the specificpurpose of logrotation in slurm.
    postrotate
            pkill -x --signal SIGUSR2 slurmctld
            pkill -x --signal SIGUSR2 slurmd
            pkill -x --signal SIGUSR2 slurmdbd
            exit 0
    endscript
I would take a look here: https://slurm.schedmd.com/slurm.conf.html#lbAQ<https://slurm.schedmd.com/slurm.conf.html#lbAQ>
Reed
On Sep 19, 2022, at 7:46 AM, Paul Raines <rai...@nmr.mgh.harvard.edu>wrote:
I have had two nights where right at 3:35am a bunch of jobs were
killed early with TIMEOUT way before  their normal TimeLimit.
The slurmctld log has lots of lines like at 3:35am with
[2022-09-12T03:35:02.303] job_time_limit: inactivity time limit reachedfor JobId=1636922
with jobs running on serveral different nodes.

The one curious thing is right about this time log rotation is happening
in cron on the slurmctld master node
Sep 12 03:30:02 mlsc-head run-parts[1719028]: (/etc/cron.daily)starting logrotateSep 12 03:34:59 mlsc-head run-parts[1719028]: (/etc/cron.daily)finished logrotate
The 5 minute runtime here is a big anomoly.  On other machines, like
nodes just running slurmd or my web servers, this only takes a coupleof seconds.
In /etc/logrotate.d/slurmctl I have

  postrotate
    systemctl reload slurmdbd >/dev/null 2>/dev/null || true
    /bin/sleep 1
    systemctl reload slurmctld >/dev/null 2>/dev/null || true
  endscript

Does it make sense that this could be causing the issue?

In slurm.conf I had InactiveLimit=60 which I guess is what is happening
but my reading of the docs on this setting was it only affects the
starting of a job with srun/salloc and not a job that has been running
for days. Is it InactiveLimit that leads to the "inactivity time limitreached" message?
Anyway, I have changed InactiveLimit=600 to see if that helps.

Re: [slurm-users] job_time_limit: inactivity time limit reached ...

Reply via email to