Hi Veronique, You understand correctly. Try 365days instead of 12months, and it will cause a single-day purge every night.
Regards, Lyn On Fri, Oct 20, 2017 at 5:25 AM, Véronique LEGRAND < veronique.legr...@pasteur.fr> wrote: > Hello, > > > > > > For 2 month now we have been finding the slurmdbd daemon down on every > 1rst of the month. > > Error messages in the logs appear shortly after midnight. > > They say: > > 2017-10-01T00:02:42.468823+02:00 tars-acct slurmdbd[7762]: error: > mysql_query failed: 1205 Lock wait timeout exceeded; try restarting > transaction#012insert into "tars_step_table" (job_db_inx, id_step, > time_start, step_name, state, tres_alloc, nodes_alloc, task_cnt, nodelist, > node_inx, task_dist, req_cpufreq, req_cpufreq_min, req_cpufreq_gov) values > (48088499, -2, 1506747882, 'batch', 1, '1=1,2=5000,4=1', 1, 1, 'tars-584', > '252', 0, 0, 0, 0) on duplicate key update nodes_alloc=1, task_cnt=1, > time_end=0, state=1, nodelist='tars-584', node_inx='252', task_dis > > 2017-10-01T00:02:42.468854+02:00 tars-acct slurmdbd[7762]: fatal: mysql > gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart > the calling program > > > > In the slurmdbd.conf we have: > > # CONTROLLER > > AuthType=auth/munge > > DbdHost=tars-acct > > PidFile=/var/run/slurmdbd.pid > > SlurmUser=slurm > > > > # DEBUG > > DebugLevel=info > > #DebugLevel=debug2 > > #DebugFlags=DB_ARCHIVE > > > > # DATABASE > > StorageType=accounting_storage/mysql > > StoragePass=slurmdbd > > StorageUser=slurmdbd > > > > # ARCHIVES > > ArchiveDir=/path/to/archives > > ArchiveEvents=yes > > ArchiveJobs=yes > > ArchiveResvs=yes > > ArchiveSteps=yes > > ArchiveSuspend=yes > > PurgeEventAfter=12months > > PurgeJobAfter=12months > > PurgeResvAfter=12months > > PurgeStepAfter=1months > > PurgeSuspendAfter=12months > > > > I have read in the slurmdbb.conf documentation : "The purge takes place at > the start of the each purge interval. For example, if the purge time is 2 > months, the purge would happen > > at the beginning of each month." > > So, I suppose that what happens, as jobs are running even at midnight is: > > - slurmdbd tries to insert a record in the job_step_table whereas the > database is locked for the purge. > > - As the purge takes a long time, the insert request times out. > > > > We didn't have that problem before but this was maybe due to the fact that > we had less jobs (usage of our cluster is always increasing) so, the purge > took less time... > > > > If I understood well what I read from the documentation: at the beginning > of the month and according to my configuration, slurm purges all jobs and > events that are older than 1 year > > and all job steps that are older than 1 month. > > Can you confirm that I understood well? > > > > If this is so, is there a way to have a shorter "purge interval"? I would > like to see if the problem happens if we purged every week. > > > > Any feedback regarding this problem is welcome. > > > > Regards, > > > > > > Véronique > > > > > > -- > > Véronique Legrand > > IT engineer – scientific calculation & software development > > https://research.pasteur.fr/en/member/veronique-legrand/ > > Cluster and computing group > > IT department > > Institut Pasteur Paris > > Tel : 95 03 > > >