In my case I tested firewall. But I'm wondering if the login nodes need to appear in the slurm.conf, and also if slurmd needs to be running on the login nodes in order for them to be a submit host? Either or both could be my issue.
On Fri, Mar 9, 2018 at 12:58 PM, Nicholas McCollum <nmccol...@asc.edu> wrote: > Connection refused makes me think a firewall issue. > > Assuming this is a test environment, could you try on the compute node: > > # iptables-save > iptables.bak > # iptables -F && iptables -X > > Then test to see if it works. To restore the firewall use: > > # iptables-restore < iptables.bak > > You may have to use... > > # systemctl stop firewalld > # systemctl start firewalld > > If you use firewalld. > > --- > > Nicholas McCollum - HPC Systems Expert > Alabama Supercomputer Authority - CSRA > > > On 03/09/2018 02:45 PM, Andy Georges wrote: > >> Hi all, >> >> Cranked up the debug level a bit >> >> Job was not started when using: >> >> vsc40075@test2802 (banette) ~> /bin/salloc -N1 -n1 /bin/srun --pty bash >> -i >> salloc: Granted job allocation 42 >> salloc: Waiting for resource configuration >> salloc: Nodes node2801 are ready for job >> >> For comparison purposes, running this on the master (head?) node: >> >> vsc40075@master23 () ~> /bin/salloc -N1 -n1 /bin/srun --pty bash -i >> salloc: Granted job allocation 43 >> salloc: Waiting for resource configuration >> salloc: Nodes node2801 are ready for job >> vsc40075@node2801 () ~> >> >> >> Below some more debug output from the hanging job. >> >> Kind regards, >> — Andy >> >> [2018-03-09T21:27:52.251] [42.0] debug: _oom_event_monitor: started. >> [2018-03-09T21:27:52.251] [42.0] debug4: eio: handling events for 1 >> objects >> [2018-03-09T21:27:52.251] [42.0] debug3: Called _msg_socket_readable >> [2018-03-09T21:27:52.251] [42.0] debug4: eio: handling events for 1 >> objects >> [2018-03-09T21:27:52.251] [42.0] debug3: Called _msg_socket_readable >> [2018-03-09T21:27:52.251] [42.0] debug2: Entering _setup_normal_io >> [2018-03-09T21:27:52.251] [42.0] debug: stdin uses a pty object >> [2018-03-09T21:27:52.251] [42.0] debug: init pty size 23:119 >> [2018-03-09T21:27:52.251] [42.0] debug4: eio: handling events for 1 >> objects >> [2018-03-09T21:27:52.251] [42.0] debug3: Called _msg_socket_readable >> [2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection >> refused >> [2018-03-09T21:27:52.252] [42.0] debug2: Error connecting slurm stream >> socket at 10.141.21.202:33698: Connection refused >> [2018-03-09T21:27:52.252] [42.0] error: slurm_open_msg_conn(pty_conn) >> 10.141.21.202,33698: Connection refused >> [2018-03-09T21:27:52.252] [42.0] debug4: adding IO connection (logical >> node rank 0) >> [2018-03-09T21:27:52.252] [42.0] debug4: connecting IO back to >> 10.141.21.202:33759 >> [2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection >> refused >> [2018-03-09T21:27:52.252] [42.0] debug3: Error connecting, picking new >> stream port >> [2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection >> refused >> [2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection >> refused >> [2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: Connection >> refused >> [2018-03-09T21:27:52.252] [42.0] debug2: Error connecting slurm stream >> socket at 10.141.21.202:33759: Connection refused >> [2018-03-09T21:27:52.252] [42.0] error: connect io: Connection refused >> [2018-03-09T21:27:52.252] [42.0] debug2: Leaving _setup_normal_io >> [2018-03-09T21:27:52.253] [42.0] error: IO setup failed: Connection >> refused >> [2018-03-09T21:27:52.253] [42.0] debug3: xcgroup_set_param: parameter >> 'freezer.state' set to 'THAWED' for '/sys/fs/cgroup/freezer/slurm/ >> uid_2540075/job_42/step_0' >> [2018-03-09T21:27:52.253] [42.0] debug3: xcgroup_set_uint32_param: >> parameter 'cgroup.procs' set to '6414' for '/sys/fs/cgroup/freezer' >> [2018-03-09T21:27:52.253] [42.0] debug2: xcgroup_delete: >> rmdir(/sys/fs/cgroup/freezer/slurm/uid_2540075/job_42): Device or >> resource busy >> [2018-03-09T21:27:52.253] [42.0] debug2: xcgroup_delete: >> rmdir(/sys/fs/cgroup/freezer/slurm/uid_2540075): Device or resource busy >> [2018-03-09T21:27:52.253] [42.0] debug: step_terminate_monitor_stop >> signaling condition >> [2018-03-09T21:27:52.253] [42.0] debug4: eio: handling events for 1 >> objects >> [2018-03-09T21:27:52.253] [42.0] debug3: Called _msg_socket_readable >> [2018-03-09T21:27:52.253] [42.0] debug2: step_terminate_monitor will run >> for 60 secs >> [2018-03-09T21:27:52.253] [42.0] debug2: step_terminate_monitor is >> stopping >> [2018-03-09T21:27:52.253] [42.0] debug2: Sending SIGKILL to pgid 6414 >> [2018-03-09T21:27:52.253] [42.0] debug3: xcgroup_set_uint32_param: >> parameter 'cgroup.procs' set to '6414' for '/sys/fs/cgroup/cpuset' >> [2018-03-09T21:27:52.265] [42.0] debug3: Took 1038 checks before stepd >> pid was removed from the step cgroup. >> [2018-03-09T21:27:52.265] [42.0] debug2: xcgroup_delete: >> rmdir(/sys/fs/cgroup/cpuset/slurm/uid_2540075/job_42): Device or >> resource busy >> [2018-03-09T21:27:52.265] [42.0] debug2: task/cgroup: not removing job >> cpuset : Device or resource busy >> [2018-03-09T21:27:52.265] [42.0] debug2: xcgroup_delete: >> rmdir(/sys/fs/cgroup/cpuset/slurm/uid_2540075): Device or resource busy >> [2018-03-09T21:27:52.265] [42.0] debug2: task/cgroup: not removing user >> cpuset : Device or resource busy >> [2018-03-09T21:27:52.315] [42.0] debug3: _oom_event_monitor: res: 1 >> [2018-03-09T21:27:52.315] [42.0] _oom_event_monitor: oom-kill event >> count: 1 >> [2018-03-09T21:27:52.315] [42.0] debug2: xcgroup_delete: >> rmdir(/sys/fs/cgroup/memory/slurm/uid_2540075/job_42): Device or >> resource busy >> [2018-03-09T21:27:52.315] [42.0] debug2: task/cgroup: not removing job >> memcg : Device or resource busy >> [2018-03-09T21:27:52.315] [42.0] debug2: xcgroup_delete: >> rmdir(/sys/fs/cgroup/memory/slurm/uid_2540075): Device or resource busy >> [2018-03-09T21:27:52.315] [42.0] debug2: task/cgroup: not removing user >> memcg : Device or resource busy >> [2018-03-09T21:27:52.315] [42.0] debug2: Before call to spank_fini() >> [2018-03-09T21:27:52.315] [42.0] debug2: After call to spank_fini() >> [2018-03-09T21:27:52.315] [42.0] error: job_manager exiting abnormally, >> rc = 4021 >> [2018-03-09T21:27:52.315] [42.0] debug: Sending launch resp rc=4021 >> [2018-03-09T21:27:52.315] [42.0] debug2: slurm_connect failed: Connection >> refused >> [2018-03-09T21:27:52.315] [42.0] debug2: Error connecting slurm stream >> socket at 10.141.21.202:37053: Connection refused >> [2018-03-09T21:27:52.315] [42.0] error: _send_launch_resp: Failed to send >> RESPONSE_LAUNCH_TASKS: Connection refused >> [2018-03-09T21:27:52.315] [42.0] debug2: Rank 0 has no children slurmstepd >> [2018-03-09T21:27:52.315] [42.0] debug2: _one_step_complete_msg: first=0, >> last=0 >> [2018-03-09T21:27:52.315] [42.0] debug3: Rank 0 sending complete to >> slurmctld, range 0 to 0 >> [2018-03-09T21:27:52.317] [42.0] debug4: eio: handling events for 1 >> objects >> [2018-03-09T21:27:52.317] [42.0] debug3: Called _msg_socket_readable >> [2018-03-09T21:27:52.317] [42.0] debug2: false, shutdown >> [2018-03-09T21:27:52.317] [42.0] debug: Message thread exited >> [2018-03-09T21:27:52.317] [42.0] done with job >> >> >