Re: [slurm-users] job_time_limit: inactivity time limit reached ...

Paul Raines Wed, 21 Sep 2022 05:39:37 -0700

Almost all the 5 min+ time was in the bzip2. The mysqldump by itself wasabout 16 seconds. So I moved the bzip2 to its own separate line so

the tables are only locked for the ~16 seconds


-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Wed, 21 Sep 2022 3:49am, Ole Holm Nielsen wrote:

Hi Paul,

IMHO, using logrotate is the most convenient method for making daily databasebackup dumps and keep a number of backup versions, see the notes in

https://wiki.fysik.dtu.dk/niflheim/Slurm_database#backup-script-with-logrotate

Using --single-transaction is recommended by SchedMD to avoid race conditionswhen slurmdbd is being run while taking the MySQL dump, see

https://bugs.schedmd.com/show_bug.cgi?id=10295#c18

/Ole

On 9/20/22 15:17, Paul Raines wrote:


 Further investigation found that I had setup logrotate to handle a mysql
 dump

    mysqldump -R --single-transaction -B slurm_db | bzip2

 which is what is taking 5 minutes.  I think this is locking tables during
 the time hanging calls to slurmdbd most likely and causing the issue.
 I will need to rework it.

 -- Paul Raines (http://help.nmr.mgh.harvard.edu)



 On Mon, 19 Sep 2022 9:29am, Reed Dier wrote:

 I’m not sure if this might be helpful, but my logrotate.d for slurm looks
 a bit differently, namely instead of a systemctl reload, I am sending a
 specific SIGUSR2 signal, which is supposedly for the specific purpose of
 logrotation in slurm.

     postrotate
             pkill -x --signal SIGUSR2 slurmctld
             pkill -x --signal SIGUSR2 slurmd
             pkill -x --signal SIGUSR2 slurmdbd
             exit 0
     endscript


 I would take a look here: https://slurm.schedmd.com/slurm.conf.html#lbAQ
 <https://slurm.schedmd.com/slurm.conf.html#lbAQ>

 Reed

 On Sep 19, 2022, at 7:46 AM, Paul Raines <rai...@nmr.mgh.harvard.edu>
 wrote:


 I have had two nights where right at 3:35am a bunch of jobs were
 killed early with TIMEOUT way before  their normal TimeLimit.
 The slurmctld log has lots of lines like at 3:35am with

 [2022-09-12T03:35:02.303] job_time_limit: inactivity time limit reached
 for JobId=1636922

 with jobs running on serveral different nodes.

 The one curious thing is right about this time log rotation is happening
 in cron on the slurmctld master node

 Sep 12 03:30:02 mlsc-head run-parts[1719028]: (/etc/cron.daily) starting
 logrotate
 Sep 12 03:34:59 mlsc-head run-parts[1719028]: (/etc/cron.daily) finished
 logrotate

 The 5 minute runtime here is a big anomoly.  On other machines, like
 nodes just running slurmd or my web servers, this only takes a couple of
 seconds.

 In /etc/logrotate.d/slurmctl I have

   postrotate
     systemctl reload slurmdbd >/dev/null 2>/dev/null || true
     /bin/sleep 1
     systemctl reload slurmctld >/dev/null 2>/dev/null || true
   endscript

 Does it make sense that this could be causing the issue?

 In slurm.conf I had InactiveLimit=60 which I guess is what is happening
 but my reading of the docs on this setting was it only affects the
 starting of a job with srun/salloc and not a job that has been running
 for days.  Is it InactiveLimit that leads to the "inactivity time limit
 reached" message?

 Anyway, I have changed InactiveLimit=600 to see if that helps.

The information in this e-mail is intended only for the person to whom it is 
addressed.  If you believe this e-mail was sent to you in error and the e-mail 
contains patient information, please contact the Mass General Brigham Compliance 
HelpLine at https://www.massgeneralbrigham.org/complianceline 
<https://www.massgeneralbrigham.org/complianceline> .

Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.

Re: [slurm-users] job_time_limit: inactivity time limit reached ...

Reply via email to