Hi,

Happy new year!

I ran into these messages while diagnosing a bug in cgroup with kernel
2.6.32-431.29.2.el6 where a bunch of jobs being cancelled caused the
system to crash.  Anyhoo, after updating the kernel the node is stable in
the event of mass job cancel.  But I noticed these messages that occur
during a job cancel:

Jan  8 09:03:41 cn6 slurmstepd[45357]: done with job
Jan  8 09:03:41 cn6 slurmstepd[45049]: done with job
Jan  8 09:03:42 cn6 slurmstepd[45115]: sending
REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15
Jan  8 09:03:42 cn6 slurmstepd[45115]: done with job
Jan  8 09:03:42 cn6 slurmstepd[45704]: error: Failed to send
MESSAGE_TASK_EXIT: Connection refused
Jan  8 09:03:42 cn6 slurmstepd[45704]: done with job
Jan  8 09:03:42 cn6 slurmstepd[45593]: error: Failed to send
MESSAGE_TASK_EXIT: Connection refused
Jan  8 09:03:42 cn6 slurmstepd[45593]: done with job
Jan  8 09:03:42 cn6 slurmstepd[45153]: sending
REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15
Jan  8 09:03:42 cn6 slurmstepd[45153]: done with job
Jan  8 09:03:42 cn6 slurmstepd[45183]: sending
REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15
Jan  8 09:03:42 cn6 slurmstepd[45183]: done with job
Jan  8 09:03:42 cn6 slurmstepd[45798]: error: Failed to send
MESSAGE_TASK_EXIT: Connection refused
Jan  8 09:03:42 cn6 slurmstepd[45798]: done with job
Jan  8 09:03:42 cn6 slurmstepd[45233]: sending
REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15
Jan  8 09:03:42 cn6 slurmstepd[45233]: done with job
Jan  8 09:03:42 cn6 slurmstepd[45642]: error: Failed to send
MESSAGE_TASK_EXIT: Connection refused


What is the connection refused messages about?  Is this normal?  Otherwise
the node seems fine.

I also see this now and then; it doesn’t make sense.  Job details are
successfully going into the /var/spool/slurm/slurmd directory on the
client.

Jan  8 09:14:27 cn6 slurmd[2799]: error: _step_connect: connect() failed
dir /var/spool/slurm/slurmd node cn6 job 874918 step -2 No such file or
directory


This is with slurm-14.11.2-1 .

Thanks!

Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167


Reply via email to