Hello Veronique,

What is the value of innodb_buffer_pool_size in my.cnf ? (assuming you're using mariadb)

Don't hesitate to set it to some GBs, ideally a little more than the size of your DB, if you have enough memory on the server. This improves the overall performance of the database, specially when it becomes bigger...

Regards.


On 10/20/2017 07:59 PM, Lyn Gerner wrote:
Re: [slurm-dev] slurm database purge,
Hi Veronique,

You understand correctly. Try 365days instead of 12months, and it will cause a single-day purge every night.

Regards,
Lyn

On Fri, Oct 20, 2017 at 5:25 AM, Véronique LEGRAND <veronique.legr...@pasteur.fr <mailto:veronique.legr...@pasteur.fr>> wrote:

    Hello,

    For 2 month now we have been finding the slurmdbd daemon down on
    every 1rst of the month.

    Error messages in the logs appear shortly after midnight.

    They say:

    2017-10-01T00:02:42.468823+02:00 tars-acct slurmdbd[7762]: error:
    mysql_query failed: 1205 Lock wait timeout exceeded; try
    restarting transaction#012insert into "tars_step_table"
    (job_db_inx, id_step, time_start, step_name, state, tres_alloc,
    nodes_alloc, task_cnt, nodelist, node_inx, task_dist, req_cpufreq,
    req_cpufreq_min, req_cpufreq_gov) values (48088499, -2,
    1506747882, 'batch', 1, '1=1,2=5000,4=1', 1, 1, 'tars-584', '252',
    0, 0, 0, 0) on duplicate key update nodes_alloc=1, task_cnt=1,
    time_end=0, state=1, nodelist='tars-584', node_inx='252', task_dis

    2017-10-01T00:02:42.468854+02:00 tars-acct slurmdbd[7762]: fatal:
    mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix
    this is restart the calling program

    In the slurmdbd.conf we have:

    # CONTROLLER

    AuthType=auth/munge

    DbdHost=tars-acct

    PidFile=/var/run/slurmdbd.pid

    SlurmUser=slurm

    # DEBUG

    DebugLevel=info

    #DebugLevel=debug2

    #DebugFlags=DB_ARCHIVE

    # DATABASE

    StorageType=accounting_storage/mysql

    StoragePass=slurmdbd

    StorageUser=slurmdbd

    # ARCHIVES

    ArchiveDir=/path/to/archives

    ArchiveEvents=yes

    ArchiveJobs=yes

    ArchiveResvs=yes

    ArchiveSteps=yes

    ArchiveSuspend=yes

    PurgeEventAfter=12months

    PurgeJobAfter=12months

    PurgeResvAfter=12months

    PurgeStepAfter=1months

    PurgeSuspendAfter=12months

    I have read in the slurmdbb.conf documentation : "The purge takes
    place at the start of the each purge interval.  For example, if
    the purge time is 2 months, the purge would happen

    at the beginning of each month."

    So, I suppose that what happens, as jobs are running even at
    midnight is:

    - slurmdbd tries to insert a record in the job_step_table whereas
    the database is locked for the purge.

    - As the purge takes a long time, the insert request times out.

    We didn't have that problem before but this was maybe due to the
    fact that we had less jobs (usage of our cluster is always
    increasing) so, the purge took less time...

    If I understood well what I read from the documentation: at the
    beginning of the month and according to my configuration, slurm
    purges all jobs and events that are older than 1 year

    and all job steps that are older than 1 month.

    Can you confirm that I understood well?

    If this is so, is there a way to have a shorter "purge interval"?
    I would like to see if the problem happens if we purged every week.

    Any feedback regarding this problem is welcome.

    Regards,

    Véronique

    --

    Véronique Legrand

    IT engineer – scientific calculation & software development

    https://research.pasteur.fr/en/member/veronique-legrand/
    <https://research.pasteur.fr/en/member/veronique-legrand/>

    Cluster and computing group

    IT department

    Institut Pasteur Paris

    Tel : 95 03



--
---
Mehdi Denou
Bull/Atos international HPC support
+336 45 57 66 56

Reply via email to