OK, I'm eating my words now. Perhaps I have had multiple issues before, but at the moment stopping the firewall allows salloc to work. Can anyone suggest an iptables rule specific to slurm? Or a way to restrict slurm communications to the right network?
On Fri, Mar 9, 2018 at 1:10 PM, Mark M <plak...@gmail.com> wrote: > > In my case I tested firewall. But I'm wondering if the login nodes need to > appear in the slurm.conf, and also if slurmd needs to be running on the > login nodes in order for them to be a submit host? Either or both could be > my issue. > > On Fri, Mar 9, 2018 at 12:58 PM, Nicholas McCollum <nmccol...@asc.edu> > wrote: > >> Connection refused makes me think a firewall issue. >> >> Assuming this is a test environment, could you try on the compute node: >> >> # iptables-save > iptables.bak >> # iptables -F && iptables -X >> >> Then test to see if it works. To restore the firewall use: >> >> # iptables-restore < iptables.bak >> >> You may have to use... >> >> # systemctl stop firewalld >> # systemctl start firewalld >> >> If you use firewalld. >> >> --- >> >> Nicholas McCollum - HPC Systems Expert >> Alabama Supercomputer Authority - CSRA >> >> >> On 03/09/2018 02:45 PM, Andy Georges wrote: >> >>> Hi all, >>> >>> Cranked up the debug level a bit >>> >>> Job was not started when using: >>> >>> vsc40075@test2802 (banette) ~> /bin/salloc -N1 -n1 /bin/srun --pty bash >>> -i >>> salloc: Granted job allocation 42 >>> salloc: Waiting for resource configuration >>> salloc: Nodes node2801 are ready for job >>> >>> For comparison purposes, running this on the master (head?) node: >>> >>> vsc40075@master23 () ~> /bin/salloc -N1 -n1 /bin/srun --pty bash -i >>> salloc: Granted job allocation 43 >>> salloc: Waiting for resource configuration >>> salloc: Nodes node2801 are ready for job >>> vsc40075@node2801 () ~> >>> >>> >>> Below some more debug output from the hanging job. >>> >>> Kind regards, >>> — Andy >>> >>> [2018-03-09T21:27:52.251] [42.0] debug: _oom_event_monitor: started. >>> [2018-03-09T21:27:52.251] [42.0] debug4: eio: handling events for 1 >>> objects >>> [2018-03-09T21:27:52.251] [42.0] debug3: Called _msg_socket_readable >>> [2018-03-09T21:27:52.251] [42.0] debug4: eio: handling events for 1 >>> objects >>> [2018-03-09T21:27:52.251] [42.0] debug3: Called _msg_socket_readable >>> [2018-03-09T21:27:52.251] [42.0] debug2: Entering _setup_normal_io >>> [2018-03-09T21:27:52.251] [42.0] debug: stdin uses a pty object >>> [2018-03-09T21:27:52.251] [42.0] debug: init pty size 23:119 >>> [2018-03-09T21:27:52.251] [42.0] debug4: eio: handling events for 1 >>> objects >>> [2018-03-09T21:27:52.251] [42.0] debug3: Called _msg_socket_readable >>> [2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: >>> Connection refused >>> [2018-03-09T21:27:52.252] [42.0] debug2: Error connecting slurm stream >>> socket at 10.141.21.202:33698: Connection refused >>> [2018-03-09T21:27:52.252] [42.0] error: slurm_open_msg_conn(pty_conn) >>> 10.141.21.202,33698: Connection refused >>> [2018-03-09T21:27:52.252] [42.0] debug4: adding IO connection (logical >>> node rank 0) >>> [2018-03-09T21:27:52.252] [42.0] debug4: connecting IO back to >>> 10.141.21.202:33759 >>> [2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: >>> Connection refused >>> [2018-03-09T21:27:52.252] [42.0] debug3: Error connecting, picking new >>> stream port >>> [2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: >>> Connection refused >>> [2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: >>> Connection refused >>> [2018-03-09T21:27:52.252] [42.0] debug2: slurm_connect failed: >>> Connection refused >>> [2018-03-09T21:27:52.252] [42.0] debug2: Error connecting slurm stream >>> socket at 10.141.21.202:33759: Connection refused >>> [2018-03-09T21:27:52.252] [42.0] error: connect io: Connection refused >>> [2018-03-09T21:27:52.252] [42.0] debug2: Leaving _setup_normal_io >>> [2018-03-09T21:27:52.253] [42.0] error: IO setup failed: Connection >>> refused >>> [2018-03-09T21:27:52.253] [42.0] debug3: xcgroup_set_param: parameter >>> 'freezer.state' set to 'THAWED' for '/sys/fs/cgroup/freezer/slurm/ >>> uid_2540075/job_42/step_0' >>> [2018-03-09T21:27:52.253] [42.0] debug3: xcgroup_set_uint32_param: >>> parameter 'cgroup.procs' set to '6414' for '/sys/fs/cgroup/freezer' >>> [2018-03-09T21:27:52.253] [42.0] debug2: xcgroup_delete: >>> rmdir(/sys/fs/cgroup/freezer/slurm/uid_2540075/job_42): Device or >>> resource busy >>> [2018-03-09T21:27:52.253] [42.0] debug2: xcgroup_delete: >>> rmdir(/sys/fs/cgroup/freezer/slurm/uid_2540075): Device or resource busy >>> [2018-03-09T21:27:52.253] [42.0] debug: step_terminate_monitor_stop >>> signaling condition >>> [2018-03-09T21:27:52.253] [42.0] debug4: eio: handling events for 1 >>> objects >>> [2018-03-09T21:27:52.253] [42.0] debug3: Called _msg_socket_readable >>> [2018-03-09T21:27:52.253] [42.0] debug2: step_terminate_monitor will run >>> for 60 secs >>> [2018-03-09T21:27:52.253] [42.0] debug2: step_terminate_monitor is >>> stopping >>> [2018-03-09T21:27:52.253] [42.0] debug2: Sending SIGKILL to pgid 6414 >>> [2018-03-09T21:27:52.253] [42.0] debug3: xcgroup_set_uint32_param: >>> parameter 'cgroup.procs' set to '6414' for '/sys/fs/cgroup/cpuset' >>> [2018-03-09T21:27:52.265] [42.0] debug3: Took 1038 checks before stepd >>> pid was removed from the step cgroup. >>> [2018-03-09T21:27:52.265] [42.0] debug2: xcgroup_delete: >>> rmdir(/sys/fs/cgroup/cpuset/slurm/uid_2540075/job_42): Device or >>> resource busy >>> [2018-03-09T21:27:52.265] [42.0] debug2: task/cgroup: not removing job >>> cpuset : Device or resource busy >>> [2018-03-09T21:27:52.265] [42.0] debug2: xcgroup_delete: >>> rmdir(/sys/fs/cgroup/cpuset/slurm/uid_2540075): Device or resource busy >>> [2018-03-09T21:27:52.265] [42.0] debug2: task/cgroup: not removing user >>> cpuset : Device or resource busy >>> [2018-03-09T21:27:52.315] [42.0] debug3: _oom_event_monitor: res: 1 >>> [2018-03-09T21:27:52.315] [42.0] _oom_event_monitor: oom-kill event >>> count: 1 >>> [2018-03-09T21:27:52.315] [42.0] debug2: xcgroup_delete: >>> rmdir(/sys/fs/cgroup/memory/slurm/uid_2540075/job_42): Device or >>> resource busy >>> [2018-03-09T21:27:52.315] [42.0] debug2: task/cgroup: not removing job >>> memcg : Device or resource busy >>> [2018-03-09T21:27:52.315] [42.0] debug2: xcgroup_delete: >>> rmdir(/sys/fs/cgroup/memory/slurm/uid_2540075): Device or resource busy >>> [2018-03-09T21:27:52.315] [42.0] debug2: task/cgroup: not removing user >>> memcg : Device or resource busy >>> [2018-03-09T21:27:52.315] [42.0] debug2: Before call to spank_fini() >>> [2018-03-09T21:27:52.315] [42.0] debug2: After call to spank_fini() >>> [2018-03-09T21:27:52.315] [42.0] error: job_manager exiting abnormally, >>> rc = 4021 >>> [2018-03-09T21:27:52.315] [42.0] debug: Sending launch resp rc=4021 >>> [2018-03-09T21:27:52.315] [42.0] debug2: slurm_connect failed: >>> Connection refused >>> [2018-03-09T21:27:52.315] [42.0] debug2: Error connecting slurm stream >>> socket at 10.141.21.202:37053: Connection refused >>> [2018-03-09T21:27:52.315] [42.0] error: _send_launch_resp: Failed to >>> send RESPONSE_LAUNCH_TASKS: Connection refused >>> [2018-03-09T21:27:52.315] [42.0] debug2: Rank 0 has no children >>> slurmstepd >>> [2018-03-09T21:27:52.315] [42.0] debug2: _one_step_complete_msg: >>> first=0, last=0 >>> [2018-03-09T21:27:52.315] [42.0] debug3: Rank 0 sending complete to >>> slurmctld, range 0 to 0 >>> [2018-03-09T21:27:52.317] [42.0] debug4: eio: handling events for 1 >>> objects >>> [2018-03-09T21:27:52.317] [42.0] debug3: Called _msg_socket_readable >>> [2018-03-09T21:27:52.317] [42.0] debug2: false, shutdown >>> [2018-03-09T21:27:52.317] [42.0] debug: Message thread exited >>> [2018-03-09T21:27:52.317] [42.0] done with job >>> >>> >> >