Hi Veronique,

You understand correctly. Try 365days instead of 12months, and it will
cause a single-day purge every night.

Regards,
Lyn

On Fri, Oct 20, 2017 at 5:25 AM, Véronique LEGRAND <
veronique.legr...@pasteur.fr> wrote:

> Hello,
>
>
>
>
>
> For 2 month now we have been finding the slurmdbd daemon down on every
> 1rst of the month.
>
> Error messages in the logs appear shortly after midnight.
>
> They say:
>
> 2017-10-01T00:02:42.468823+02:00 tars-acct slurmdbd[7762]: error:
> mysql_query failed: 1205 Lock wait timeout exceeded; try restarting
> transaction#012insert into "tars_step_table" (job_db_inx, id_step,
> time_start, step_name, state, tres_alloc, nodes_alloc, task_cnt, nodelist,
> node_inx, task_dist, req_cpufreq, req_cpufreq_min, req_cpufreq_gov) values
> (48088499, -2, 1506747882, 'batch', 1, '1=1,2=5000,4=1', 1, 1, 'tars-584',
> '252', 0, 0, 0, 0) on duplicate key update nodes_alloc=1, task_cnt=1,
> time_end=0, state=1, nodelist='tars-584', node_inx='252', task_dis
>
> 2017-10-01T00:02:42.468854+02:00 tars-acct slurmdbd[7762]: fatal: mysql
> gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart
> the calling program
>
>
>
> In the slurmdbd.conf we have:
>
> # CONTROLLER
>
> AuthType=auth/munge
>
> DbdHost=tars-acct
>
> PidFile=/var/run/slurmdbd.pid
>
> SlurmUser=slurm
>
>
>
> # DEBUG
>
> DebugLevel=info
>
> #DebugLevel=debug2
>
> #DebugFlags=DB_ARCHIVE
>
>
>
> # DATABASE
>
> StorageType=accounting_storage/mysql
>
> StoragePass=slurmdbd
>
> StorageUser=slurmdbd
>
>
>
> # ARCHIVES
>
> ArchiveDir=/path/to/archives
>
> ArchiveEvents=yes
>
> ArchiveJobs=yes
>
> ArchiveResvs=yes
>
> ArchiveSteps=yes
>
> ArchiveSuspend=yes
>
> PurgeEventAfter=12months
>
> PurgeJobAfter=12months
>
> PurgeResvAfter=12months
>
> PurgeStepAfter=1months
>
> PurgeSuspendAfter=12months
>
>
>
> I have read in the slurmdbb.conf documentation : "The purge takes place at
> the start of the each purge interval.  For example, if the purge time is 2
> months, the purge would happen
>
> at the beginning of each month."
>
> So, I suppose that what happens, as jobs are running even at midnight is:
>
> - slurmdbd tries to insert a record in the job_step_table whereas the
> database is locked for the purge.
>
> - As the purge takes a long time, the insert request times out.
>
>
>
> We didn't have that problem before but this was maybe due to the fact that
> we had less jobs (usage of our cluster is always increasing) so, the purge
> took less time...
>
>
>
> If I understood well what I read from the documentation: at the beginning
> of the month and according to my configuration, slurm purges all jobs and
> events that are older than 1 year
>
> and all job steps that are older than 1 month.
>
> Can you confirm that I understood well?
>
>
>
> If this is so, is there a way to have a shorter "purge interval"? I would
> like to see if the problem happens if we purged every week.
>
>
>
> Any feedback regarding this problem is welcome.
>
>
>
> Regards,
>
>
>
>
>
> Véronique
>
>
>
>
>
> --
>
> Véronique Legrand
>
> IT engineer – scientific calculation & software development
>
> https://research.pasteur.fr/en/member/veronique-legrand/
>
> Cluster and computing group
>
> IT department
>
> Institut Pasteur Paris
>
> Tel : 95 03
>
>
>

Reply via email to