I’m not sure if this might be helpful, but my logrotate.d for slurm looks a bit differently, namely instead of a systemctl reload, I am sending a specific SIGUSR2 signal, which is supposedly for the specific purpose of logrotation in slurm.
> postrotate > pkill -x --signal SIGUSR2 slurmctld > pkill -x --signal SIGUSR2 slurmd > pkill -x --signal SIGUSR2 slurmdbd > exit 0 > endscript I would take a look here: https://slurm.schedmd.com/slurm.conf.html#lbAQ <https://slurm.schedmd.com/slurm.conf.html#lbAQ> Reed > On Sep 19, 2022, at 7:46 AM, Paul Raines <rai...@nmr.mgh.harvard.edu> wrote: > > > I have had two nights where right at 3:35am a bunch of jobs were > killed early with TIMEOUT way before their normal TimeLimit. > The slurmctld log has lots of lines like at 3:35am with > > [2022-09-12T03:35:02.303] job_time_limit: inactivity time limit reached for > JobId=1636922 > > with jobs running on serveral different nodes. > > The one curious thing is right about this time log rotation is happening > in cron on the slurmctld master node > > Sep 12 03:30:02 mlsc-head run-parts[1719028]: (/etc/cron.daily) starting > logrotate > Sep 12 03:34:59 mlsc-head run-parts[1719028]: (/etc/cron.daily) finished > logrotate > > The 5 minute runtime here is a big anomoly. On other machines, like > nodes just running slurmd or my web servers, this only takes a couple of > seconds. > > In /etc/logrotate.d/slurmctl I have > > postrotate > systemctl reload slurmdbd >/dev/null 2>/dev/null || true > /bin/sleep 1 > systemctl reload slurmctld >/dev/null 2>/dev/null || true > endscript > > Does it make sense that this could be causing the issue? > > In slurm.conf I had InactiveLimit=60 which I guess is what is happening > but my reading of the docs on this setting was it only affects the > starting of a job with srun/salloc and not a job that has been running > for days. Is it InactiveLimit that leads to the "inactivity time limit > reached" message? > > Anyway, I have changed InactiveLimit=600 to see if that helps. > > > --------------------------------------------------------------- > Paul Raines http://help.nmr.mgh.harvard.edu > MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging > 149 (2301) 13th Street Charlestown, MA 02129 USA > > > > The information in this e-mail is intended only for the person to whom it is > addressed. If you believe this e-mail was sent to you in error and the > e-mail contains patient information, please contact the Mass General Brigham > Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline > <https://www.massgeneralbrigham.org/complianceline> . > Please note that this e-mail is not secure (encrypted). If you do not wish > to continue communication over unencrypted e-mail, please notify the sender > of this message immediately. Continuing to send or respond to e-mail after > receiving this message means you understand and accept this risk and wish to > continue to communicate over unencrypted e-mail. >