Hi, Happy new year!
I ran into these messages while diagnosing a bug in cgroup with kernel 2.6.32-431.29.2.el6 where a bunch of jobs being cancelled caused the system to crash. Anyhoo, after updating the kernel the node is stable in the event of mass job cancel. But I noticed these messages that occur during a job cancel: Jan 8 09:03:41 cn6 slurmstepd[45357]: done with job Jan 8 09:03:41 cn6 slurmstepd[45049]: done with job Jan 8 09:03:42 cn6 slurmstepd[45115]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15 Jan 8 09:03:42 cn6 slurmstepd[45115]: done with job Jan 8 09:03:42 cn6 slurmstepd[45704]: error: Failed to send MESSAGE_TASK_EXIT: Connection refused Jan 8 09:03:42 cn6 slurmstepd[45704]: done with job Jan 8 09:03:42 cn6 slurmstepd[45593]: error: Failed to send MESSAGE_TASK_EXIT: Connection refused Jan 8 09:03:42 cn6 slurmstepd[45593]: done with job Jan 8 09:03:42 cn6 slurmstepd[45153]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15 Jan 8 09:03:42 cn6 slurmstepd[45153]: done with job Jan 8 09:03:42 cn6 slurmstepd[45183]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15 Jan 8 09:03:42 cn6 slurmstepd[45183]: done with job Jan 8 09:03:42 cn6 slurmstepd[45798]: error: Failed to send MESSAGE_TASK_EXIT: Connection refused Jan 8 09:03:42 cn6 slurmstepd[45798]: done with job Jan 8 09:03:42 cn6 slurmstepd[45233]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15 Jan 8 09:03:42 cn6 slurmstepd[45233]: done with job Jan 8 09:03:42 cn6 slurmstepd[45642]: error: Failed to send MESSAGE_TASK_EXIT: Connection refused What is the connection refused messages about? Is this normal? Otherwise the node seems fine. I also see this now and then; it doesn’t make sense. Job details are successfully going into the /var/spool/slurm/slurmd directory on the client. Jan 8 09:14:27 cn6 slurmd[2799]: error: _step_connect: connect() failed dir /var/spool/slurm/slurmd node cn6 job 874918 step -2 No such file or directory This is with slurm-14.11.2-1 . Thanks! Chris — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167