Hello Veronique,
What is the value of innodb_buffer_pool_size in my.cnf ? (assuming
you're using mariadb)
Don't hesitate to set it to some GBs, ideally a little more than the
size of your DB, if you have enough memory on the server. This improves
the overall performance of the database, specially when it becomes bigger...
Regards.
On 10/20/2017 07:59 PM, Lyn Gerner wrote:
Re: [slurm-dev] slurm database purge,
Hi Veronique,
You understand correctly. Try 365days instead of 12months, and it will
cause a single-day purge every night.
Regards,
Lyn
On Fri, Oct 20, 2017 at 5:25 AM, Véronique LEGRAND
<veronique.legr...@pasteur.fr <mailto:veronique.legr...@pasteur.fr>>
wrote:
Hello,
For 2 month now we have been finding the slurmdbd daemon down on
every 1rst of the month.
Error messages in the logs appear shortly after midnight.
They say:
2017-10-01T00:02:42.468823+02:00 tars-acct slurmdbd[7762]: error:
mysql_query failed: 1205 Lock wait timeout exceeded; try
restarting transaction#012insert into "tars_step_table"
(job_db_inx, id_step, time_start, step_name, state, tres_alloc,
nodes_alloc, task_cnt, nodelist, node_inx, task_dist, req_cpufreq,
req_cpufreq_min, req_cpufreq_gov) values (48088499, -2,
1506747882, 'batch', 1, '1=1,2=5000,4=1', 1, 1, 'tars-584', '252',
0, 0, 0, 0) on duplicate key update nodes_alloc=1, task_cnt=1,
time_end=0, state=1, nodelist='tars-584', node_inx='252', task_dis
2017-10-01T00:02:42.468854+02:00 tars-acct slurmdbd[7762]: fatal:
mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix
this is restart the calling program
In the slurmdbd.conf we have:
# CONTROLLER
AuthType=auth/munge
DbdHost=tars-acct
PidFile=/var/run/slurmdbd.pid
SlurmUser=slurm
# DEBUG
DebugLevel=info
#DebugLevel=debug2
#DebugFlags=DB_ARCHIVE
# DATABASE
StorageType=accounting_storage/mysql
StoragePass=slurmdbd
StorageUser=slurmdbd
# ARCHIVES
ArchiveDir=/path/to/archives
ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes
ArchiveSteps=yes
ArchiveSuspend=yes
PurgeEventAfter=12months
PurgeJobAfter=12months
PurgeResvAfter=12months
PurgeStepAfter=1months
PurgeSuspendAfter=12months
I have read in the slurmdbb.conf documentation : "The purge takes
place at the start of the each purge interval. For example, if
the purge time is 2 months, the purge would happen
at the beginning of each month."
So, I suppose that what happens, as jobs are running even at
midnight is:
- slurmdbd tries to insert a record in the job_step_table whereas
the database is locked for the purge.
- As the purge takes a long time, the insert request times out.
We didn't have that problem before but this was maybe due to the
fact that we had less jobs (usage of our cluster is always
increasing) so, the purge took less time...
If I understood well what I read from the documentation: at the
beginning of the month and according to my configuration, slurm
purges all jobs and events that are older than 1 year
and all job steps that are older than 1 month.
Can you confirm that I understood well?
If this is so, is there a way to have a shorter "purge interval"?
I would like to see if the problem happens if we purged every week.
Any feedback regarding this problem is welcome.
Regards,
Véronique
--
Véronique Legrand
IT engineer – scientific calculation & software development
https://research.pasteur.fr/en/member/veronique-legrand/
<https://research.pasteur.fr/en/member/veronique-legrand/>
Cluster and computing group
IT department
Institut Pasteur Paris
Tel : 95 03
--
---
Mehdi Denou
Bull/Atos international HPC support
+336 45 57 66 56