Hello,

I'm sure this issue has been answered before but I'm trying to clean runaway jobs with:


sacctmgr -vvvv show runawayjobs


I get a very (very) long list of records and after a while the command crashes with the following error message:


sacctmgr: error: _conn_readable: persistent connection for fd 3 experienced error[104]: Connection reset by peer sacctmgr: debug2: _slurm_connect: failed to connect to 127.0.1.1:6819: Connection refused sacctmgr: debug2: Error connecting slurm stream socket at 127.0.1.1:6819: Connection refused sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:master1:6819: Connection refused
sacctmgr: error: Getting response to message type: MsgType=1488
sacctmgr: error: Failed to fix runaway job: Unspecified error


The slurmdbd daemons also crashes (maybe I should increase debug log level):


Dec 21 15:53:18 master1 systemd[1]: slurmdbd.service: main process exited, code=exited, status=1/FAILURE Dec 21 15:53:18 master1 systemd[1]: Unit slurmdbd.service entered failed state.
Dec 21 15:53:18 master1 systemd[1]: slurmdbd.service failed.


I'm running slurm 21.08.8-2.


Not sure if this is related but I tried to increase innodb_buffer_pool_size to 32G in mysql conf, without success.


Any help would be greatly appreciated.


--
Julien Rey

Plate-forme RPBS
Unité BFA - CMPLI
Université de Paris
tel: 01 57 27 83 95


Reply via email to