Hi Paul,
Interesting observation on the execution time and the pipe! How do you
ensure that you have enough disk space for the uncompressed database dump?
Maybe using /dev/shmem?
The lbzip2 mentioned in the link below is significantly faster than bzip2.
Best regards,
Ole
On 9/21/22 14:38,
Almost all the 5 min+ time was in the bzip2. The mysqldump by itself was
about 16 seconds. So I moved the bzip2 to its own separate line so
the tables are only locked for the ~16 seconds
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Wed, 21 Sep 2022 3:49am, Ole Holm Nielsen wrote:
Hi Paul,
IMHO, using logrotate is the most convenient method for making daily
database backup dumps and keep a number of backup versions, see the notes in
https://wiki.fysik.dtu.dk/niflheim/Slurm_database#backup-script-with-logrotate
Using --single-transaction is recommended by SchedMD to avoi
Further investigation found that I had setup logrotate to handle a mysql
dump
mysqldump -R --single-transaction -B slurm_db | bzip2
which is what is taking 5 minutes. I think this is locking tables during
the time hanging calls to slurmdbd most likely and causing the issue.
I will need to re
On 19/9/22 05:46, Paul Raines wrote:
In slurm.conf I had InactiveLimit=60 which I guess is what is happening
but my reading of the docs on this setting was it only affects the
starting of a job with srun/salloc and not a job that has been running
for days. Is it InactiveLimit that leads to the
Paul,
You are likely spot on with the inactiveLimit change. It may also be an
environment variable of TMOUT (under bash) set.
Brian Andrus
On 9/19/2022 5:46 AM, Paul Raines wrote:
I have had two nights where right at 3:35am a bunch of jobs were
killed early with TIMEOUT way before their no
I’m not sure if this might be helpful, but my logrotate.d for slurm looks a bit
differently, namely instead of a systemctl reload, I am sending a specific
SIGUSR2 signal, which is supposedly for the specific purpose of logrotation in
slurm.
> postrotate
> pkill -x --signal SIGUS
I have had two nights where right at 3:35am a bunch of jobs were
killed early with TIMEOUT way before their normal TimeLimit.
The slurmctld log has lots of lines like at 3:35am with
[2022-09-12T03:35:02.303] job_time_limit: inactivity time limit reached
for JobId=1636922
with jobs running o