On 4/5/19 4:28 PM, Julien Rey wrote:
The failure occurs after a few minutes (~10).
And we are running out of space on the slurm controller. The mysql
daemon is at 100% CPU usage all the time. This issue is becoming critical.
...
Our slurm accounting database is growing bigger and bigger (more
than 100Gb) and is never being purged. We are running slurm
15.08.0-0pre1. I would like to upgrade to a more recent version of
the slurmdbd, but my fear is that it may break everything during
the update of the database.
Here is our slurmdbd.conf :
AuthType=auth/munge
AuthInfo=/var/run/munge/munge.socket.2
DbdHost=localhost
DebugLevel=6
StorageHost=localhost
StorageLoc=slurm_acct_db
StoragePass=shazaam
StorageType=accounting_storage/mysql
StorageUser=slurm
LogFile=/var/log/slurm-llnl/slurmdbd.log
PidFile=/var/run/slurm-llnl/slurmdbd.pid
SlurmUser=slurm
ArchiveDir=/home/joule/archives
PurgeEventAfter=18
PurgeJobAfter=18
PurgeResvAfter=1
PurgeStepAfter=1
PurgeSuspendAfter=1
I tried to purge it manually using this command but the slurmdbd
daemon ends up crashing and it doesn't remove anything:
One more observation: You are using the default monthly intervals (18
means 18months). A monthly purge operation can be a huge amount of work
for a database of your size, and you certainly want to cut down the
amount of work required during the purges.
It is probably a good idea to try out a series of daily purges starting
with:
PurgeEventAfter=2000days
PurgeJobAfter=2000days
PurgeResvAfter=2000days
PurgeStepAfter=2000days
PurgeSuspendAfter=2000days
If this works well over a few days, decrease the purge interval 2000days
little by little and try again (1800, 1500, etc) until you after many
iterations come down to the desired final purge intervals.
See some further details in
https://wiki.fysik.dtu.dk/niflheim/Slurm_database#setting-database-purge-parameters
Best regards,
Ole