Hello all, (this _might_ be related to https://svn.open-mpi.org/trac/ompi/ticket/1505)
I just compiled and installed 1.3.3 ins a CentOS 5 environment and we noticed the processes would deadlock as soon as they would start using TCP communications. The test program is one that has been running on other clusters for years with no problems. Furthermore, using local cores doesn't deadlock the process whereas forcing inter-node communications (-bynode scheduling), immediately causes the problem. Symptoms: - processes don't crash or die, the use 100% CPU in system space (as opposed to user space) - stracing one of the processes will show it is freewheeling in a polling loop. - executing with --mca btl_base_verbose 30 will show weird port assignments, either they are wrong or should be interpreted as being an offset from the default btl_tcp_port_min_v4 (1024). - The error "mca_btl_tcp_endpoint_complete_connect] connect() to <IP ADDR> failed: No route to host (113)" _may_ be seen. We noticed it only showed up if we had vmnet interfaces up and running on certain nodes. Note that setting oob_tcp_listen_mode=listen_thread oob_tcp_if_include=eth0 btl_tcp_if_include=eth0 was one of our first reaction to this to no avail. Workaround we found: While keeping the above mentioned MCA parameters, we added btl_tcp_port_min_v4=2000 due to some firewall rules (which we had obviously disabled as part of the trouble shooting process) and noticed everything seemed to start working correctly from here on. This seems to work but I can find no logical explanation as the code seems to be clean in that respect. Some pasting for people searching frantically for a solution: [cluster-srv2:20379] btl: tcp: attempting to connect() to address 10.194.32.113 on port 2052 [cluster-srv2:20381] btl: tcp: attempting to connect() to address 10.194.32.113 on port 3076 [cluster-srv2:20377] btl: tcp: attempting to connect() to address 10.194.32.113 on port 260 [cluster-srv2:20383] btl: tcp: attempting to connect() to address 10.194.32.113 on port 3588 [cluster-srv1:19900] btl: tcp: attempting to connect() to address 10.194.32.117 on port 1540 [cluster-srv2:20377] btl: tcp: attempting to connect() to address 10.194.32.117 on port 2052 [cluster-srv2:20383] btl: tcp: attempting to connect() to address 10.194.32.117 on port 3076 [cluster-srv1:19894] btl: tcp: attempting to connect() to address 10.194.32.117 on port 516 [cluster-srv2:20379] btl: tcp: attempting to connect() to address 10.194.32.117 on port 3588 [cluster-srv1:19898] btl: tcp: attempting to connect() to address 10.194.32.117 on port 1028 [cluster-srv2:20381] btl: tcp: attempting to connect() to address 10.194.32.117 on port 2564 [cluster-srv1:19896] btl: tcp: attempting to connect() to address 10.194.32.117 on port 4 [cluster-srv3:13665] btl: tcp: attempting to connect() to address 10.194.32.115 on port 1028 [cluster-srv3:13663] btl: tcp: attempting to connect() to address 10.194.32.115 on port 4 [cluster-srv2][[44096,1],9][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] [cluster-srv2][[44096,1],13][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 10.194.32.117 failed: No route to host (113) connect() to 10.194.32.117 failed: No route to host (113) [cluster-srv3][[44096,1],20][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 10.194.32.115 failed: No route to host (113) Cheers! Eric Thiboedau