On 4/5/19 4:28 PM, Julien Rey wrote:
The failure occurs after a few minutes (~10).

And we are running out of space on the slurm controller. The mysql daemon is at 100% CPU usage all the time. This issue is becoming critical.
...
Our slurm accounting database is growing bigger and bigger (more than 100Gb) and is never being purged. We are running slurm 15.08.0-0pre1. I would like to upgrade to a more recent version of the slurmdbd, but my fear is that it may break everything during the update of the database.

Here is our slurmdbd.conf :

AuthType=auth/munge
AuthInfo=/var/run/munge/munge.socket.2
DbdHost=localhost
DebugLevel=6
StorageHost=localhost
StorageLoc=slurm_acct_db
StoragePass=shazaam
StorageType=accounting_storage/mysql
StorageUser=slurm
LogFile=/var/log/slurm-llnl/slurmdbd.log
PidFile=/var/run/slurm-llnl/slurmdbd.pid
SlurmUser=slurm
ArchiveDir=/home/joule/archives
PurgeEventAfter=18
PurgeJobAfter=18
PurgeResvAfter=1
PurgeStepAfter=1
PurgeSuspendAfter=1

I tried to purge it manually using this command but the slurmdbd daemon ends up crashing and it doesn't remove anything:

One more observation: You are using the default monthly intervals (18 means 18months). A monthly purge operation can be a huge amount of work for a database of your size, and you certainly want to cut down the amount of work required during the purges.

It is probably a good idea to try out a series of daily purges starting with:

PurgeEventAfter=2000days
PurgeJobAfter=2000days
PurgeResvAfter=2000days
PurgeStepAfter=2000days
PurgeSuspendAfter=2000days

If this works well over a few days, decrease the purge interval 2000days little by little and try again (1800, 1500, etc) until you after many iterations come down to the desired final purge intervals.

See some further details in https://wiki.fysik.dtu.dk/niflheim/Slurm_database#setting-database-purge-parameters

Best regards,
Ole

Reply via email to