Hello all,

   (this _might_ be related to https://svn.open-mpi.org/trac/ompi/ticket/1505)

   I just compiled and installed 1.3.3 ins a CentOS 5 environment and we 
noticed the
processes would deadlock as soon as they would start using TCP communications. 
The
test program is one that has been running on other clusters for years with no
problems. Furthermore, using local cores doesn't deadlock the process whereas 
forcing
inter-node communications (-bynode scheduling), immediately causes the problem.

Symptoms:
- processes don't crash or die, the use 100% CPU in system space (as opposed to 
user space)
- stracing one of the processes will show it is freewheeling in a polling loop.
- executing with --mca btl_base_verbose 30 will show weird port assignments, 
either they
are wrong or should be interpreted as being an offset from the default
btl_tcp_port_min_v4 (1024).
- The error "mca_btl_tcp_endpoint_complete_connect] connect() to <IP ADDR> 
failed: No
route to host (113)" _may_ be seen. We noticed it only showed up if we had vmnet
interfaces up and running on certain nodes. Note that setting

 oob_tcp_listen_mode=listen_thread
 oob_tcp_if_include=eth0
 btl_tcp_if_include=eth0

was one of our first reaction to this to no avail.

Workaround we found:

While keeping the above mentioned MCA parameters, we added 
btl_tcp_port_min_v4=2000 due
to some firewall rules (which we had obviously disabled as part of the trouble 
shooting
process) and noticed everything seemed to start working correctly from here on.

This seems to work but I can find no logical explanation as the code seems to 
be clean
in that respect.

Some pasting for people searching frantically for a solution:

[cluster-srv2:20379] btl: tcp: attempting to connect() to address 10.194.32.113 
on port
2052
[cluster-srv2:20381] btl: tcp: attempting to connect() to address 10.194.32.113 
on port
3076
[cluster-srv2:20377] btl: tcp: attempting to connect() to address 10.194.32.113 
on port 260
[cluster-srv2:20383] btl: tcp: attempting to connect() to address 10.194.32.113 
on port
3588
[cluster-srv1:19900] btl: tcp: attempting to connect() to address 10.194.32.117 
on port
1540
[cluster-srv2:20377] btl: tcp: attempting to connect() to address 10.194.32.117 
on port
2052
[cluster-srv2:20383] btl: tcp: attempting to connect() to address 10.194.32.117 
on port
3076
[cluster-srv1:19894] btl: tcp: attempting to connect() to address 10.194.32.117 
on port 516
[cluster-srv2:20379] btl: tcp: attempting to connect() to address 10.194.32.117 
on port
3588
[cluster-srv1:19898] btl: tcp: attempting to connect() to address 10.194.32.117 
on port
1028
[cluster-srv2:20381] btl: tcp: attempting to connect() to address 10.194.32.117 
on port
2564
[cluster-srv1:19896] btl: tcp: attempting to connect() to address 10.194.32.117 
on port 4
[cluster-srv3:13665] btl: tcp: attempting to connect() to address 10.194.32.115 
on port
1028
[cluster-srv3:13663] btl: tcp: attempting to connect() to address 10.194.32.115 
on port 4
[cluster-srv2][[44096,1],9][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
[cluster-srv2][[44096,1],13][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 10.194.32.117 failed: No route to host (113)
connect() to 10.194.32.117 failed: No route to host (113)
[cluster-srv3][[44096,1],20][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 10.194.32.115 failed: No route to host (113)

Cheers!

Eric Thiboedau

Reply via email to