[OMPI users] problems with MPI_Waitsome/MPI_Allstart and OpenMPI on gigabit and IB networks

Joe Landman Sun, 20 Jul 2008 10:42:48 -0400

Hi folks:

This is a deeper dive into the code that was giving me fits over thelast two weeks.

It uses MPI_Waitsome and MPI_Allstart to launch/monitor progress.More on that in a moment.

The testing I have done to date on this platform suggests thatOpenMPI is working fine, though I don't normally exercise these two APIfunctions. Other MPI codes run without problem. The gigabit and IBnetworks are operational, with no issues that I can spot.


  The symptoms:

1) smaller test cases *sometimes* work, and sometimes hang. The hangingappears (in strace) to be a tight loop in a poll. Changing from defaultto btl tcp,self seems to help the code a little, and the test jobs runrepeatedly to conclusion. The same binary, larger code, more CPUs (64vs 4), does not work, regardless of btl settings.

2) this happens with OpenMPI 1.2.2, 1.2.6, 1.2.7. I will check otherstacks as well, but my hope is to use OpenMPI due to the nice (sane)interface to SGE.


3) using btl to turn off sm and openib, generates lots of these messages:

[c1-8][0,1,4][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]connect() failed with errno=113[c1-8][0,1,5][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]connect() failed with errno=113[c1-8][0,1,6][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]connect() failed with errno=113[c1-5][0,1,24][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]connect() failed with errno=113[c1-5][0,1,28][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]connect() failed with errno=113[c1-11][0,1,41][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]connect() failed with errno=113[c1-11][0,1,45][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]connect() failed with errno=113

The FAQ reports that this is a TCP error and the error number of 113corresponds to


No route to host at -e line 1.

This is wrong, all the nodes are visible from all the other nodes on aprivate subnet. For example:


scalable:~ # pdsh "ping -c 1 c1-8 | grep '64 bytes'"

c1-1: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64time=0.126 msc1-12: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64time=0.067 msc1-13: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64time=0.127 msc1-11: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64time=0.084 msc1-4: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64time=0.090 msc1-16: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64time=0.116 msc1-14: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64time=0.076 msc1-2: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64time=0.113 msc1-3: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64time=0.065 msc1-5: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64time=0.127 msc1-17: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64time=0.046 msc1-6: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64time=0.073 msc1-8: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64time=0.020 msc1-15: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64time=0.109 msc1-7: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64time=0.075 msc1-9: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64time=0.098 ms

Basically the problem appears to be that MPI_Waitsome is looping foreveras it can't seem to see posted completions using IB. With TCP, itappears to have other issues, that are problematic, though don't exhibitthemselves with other tests.

I am not sure if this is a bug in the implementation of MPI_Waitsome,though the odd behavior differences between the transports and thescaling observation, suggests some sort of buffer size issue. Are thereany specific things we can do to tweak internal OpenMPI buffer sizes toexperiment with this? Should I rebuild OpenMPI with -O0? Should Iuse the Intel compiler for OpenMPI (using gcc 4.1.2 right now)? Maincode is in fortran and we are using Intel 10.1.015. Are there any tcpstack issues I should be thinking about to deal with the 113 error (userwould be ok with tcp while we get IB ironed out).


Please advise.

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: land...@scalableinformatics.com
web  : http://www.scalableinformatics.com
       http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615

[OMPI users] problems with MPI_Waitsome/MPI_Allstart and OpenMPI on gigabit and IB networks

Reply via email to