Hi Julien,

Did you optimize the MySQL database, in particular InnoDB?

I have collected some documentation in my Wiki page https://wiki.fysik.dtu.dk/niflheim/Slurm_database#mysql-configuration
and I also discuss database purging.

Please note that we run Slurm 17.11 (and recently 18.08) on CentOS 7.6 systems which come with a MariaDB 5.5 database. We have no problems with the database or the daily purging operations (see https://wiki.fysik.dtu.dk/niflheim/Slurm_database#setting-database-purge-parameters).

If you are upgrading your Slurm version (or planning to do it), I also recommend you to read the thread [slurm-users] "Extreme long db upgrade 16.05.6 -> 17.11.3" from the last few days.

Best regards,
Ole

On 4/5/19 4:28 PM, Julien Rey wrote:
The failure occurs after a few minutes (~10).

And we are running out of space on the slurm controller. The mysql daemon is at 100% CPU usage all the time. This issue is becoming critical.

Le 05/04/2019 16:10, Paul Edmon a écrit :
Did it just time out, or did that failure happen immediately.  If immediate you may be in a situation where you are hitting a bug. It "should" be safe to upgrade to a later version of 15.08.*. There may be fixes in there related to that.  I would look at the changelog though just to see if there is any database work that was done.

-Paul Edmon-

On 4/5/19 9:05 AM, Julien Rey wrote:
Hi Paul, thanks for your advice. Actually I already tried what you suggested. No matter what value do I put after PurgeJobAfter I always end up with the same error:

sacctmgr archive dump Directory=/home/joule/archives/ PurgeJobAfter=1days
sacctmgr: error: slurmdbd: Getting response to message type 1459
sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
 Problem dumping archive: Unspecified error

sacctmgr archive dump Directory=/home/joule/archives/ PurgeJobAfter=48months
sacctmgr: error: slurmdbd: Getting response to message type 1459
sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
 Problem dumping archive: Unspecified error

Has anyone tried to truncate tables by hand directly in the mysql command line ?

Le 04/04/2019 16:13, Paul Edmon a écrit :
We ran into this problem in the past.  I know that fixes were put in to deal with large purges as a result of our problems but I don't recall what version they ended up in, likely newer than 15.08.0.

A solution that can work is to walk up the time so that instead of one large purge you do several smaller purges. That at least worked for us in the past.

-Paul Edmon-

On 4/4/19 9:38 AM, Julien Rey wrote:
Hello,

Our slurm accounting database is growing bigger and bigger (more than 100Gb) and is never being purged. We are running slurm 15.08.0-0pre1. I would like to upgrade to a more recent version of the slurmdbd, but my fear is that it may break everything during the update of the database.

Here is our slurmdbd.conf :

AuthType=auth/munge
AuthInfo=/var/run/munge/munge.socket.2
DbdHost=localhost
DebugLevel=6
StorageHost=localhost
StorageLoc=slurm_acct_db
StoragePass=shazaam
StorageType=accounting_storage/mysql
StorageUser=slurm
LogFile=/var/log/slurm-llnl/slurmdbd.log
PidFile=/var/run/slurm-llnl/slurmdbd.pid
SlurmUser=slurm
ArchiveDir=/home/joule/archives
PurgeEventAfter=18
PurgeJobAfter=18
PurgeResvAfter=1
PurgeStepAfter=1
PurgeSuspendAfter=1

I tried to purge it manually using this command but the slurmdbd daemon ends up crashing and it doesn't remove anything:

sacctmgr archive dump Directory=/home/joule/archives/ PurgeJobAfter=365days

sacctmgr: error: slurmdbd: Getting response to message type 1459
sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
 Problem dumping archive: Unspecified error

Sometimes I have to restart the mysql daemon (we are running mysql 5.5.39-1). The /var/log/slurm-llnl/slurmdbd.log shows nothings. The mysql logs are empty.

I tried to increase these values in my.cnf but so far no success :

innodb_buffer_pool_size        = 32G
innodb_lock_wait_timeout    = 3600

Is there any way to solve this issue ? Otherwise, what would be the procedure for deleting the database records altogether and starting on a fresh new one ?

Reply via email to