On 7/17/19 12:26 AM, Chris Samuel wrote:
On 16/7/19 11:43 am, Will Dennis wrote:
[2019-07-16T09:36:51.464] error: slurmdbd: agent queue is full (20140),
discarding DBD_STEP_START:1442 request
So it looks like your slurmdbd cannot keep up with the rate of these incoming
steps and is having to throw away messages.
[2019-07-16T09:40:27.515] error: slurmdbd: agent queue filling (20140),
RESTART SLURMDBD NOW
Have you tried doing what it told you to?
You may want to look at the performance of you MySQL server to see if it's
failing to keep up with what slurmdbd is asking it to do.
All the best,
Chris
Once you have the database performance issues addressed, sacctmgr can clean up
the entries for completed jobs listed as running.
'sacctmgr list/show runawayjobs'
RunawayJobs
Used only with the list or show command to report current jobs
that have been orphanded on the local cluster and are now runaway. If there are
jobs in this state it will also give you an option to "fix" them.