Hi folks:

This is a deeper dive into the code that was giving me fits over the last two weeks.

It uses MPI_Waitsome and MPI_Allstart to launch/monitor progress. More on that in a moment.

The testing I have done to date on this platform suggests that OpenMPI is working fine, though I don't normally exercise these two API functions. Other MPI codes run without problem. The gigabit and IB networks are operational, with no issues that I can spot.

  The symptoms:

1) smaller test cases *sometimes* work, and sometimes hang. The hanging appears (in strace) to be a tight loop in a poll. Changing from default to btl tcp,self seems to help the code a little, and the test jobs run repeatedly to conclusion. The same binary, larger code, more CPUs (64 vs 4), does not work, regardless of btl settings.

2) this happens with OpenMPI 1.2.2, 1.2.6, 1.2.7. I will check other stacks as well, but my hope is to use OpenMPI due to the nice (sane) interface to SGE.

3) using btl to turn off sm and openib, generates lots of these messages:

[c1-8][0,1,4][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 [c1-8][0,1,5][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 [c1-8][0,1,6][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 [c1-5][0,1,24][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 [c1-5][0,1,28][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 [c1-11][0,1,41][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 [c1-11][0,1,45][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113

The FAQ reports that this is a TCP error and the error number of 113 corresponds to

No route to host at -e line 1.

This is wrong, all the nodes are visible from all the other nodes on a private subnet. For example:

scalable:~ # pdsh "ping -c 1 c1-8 | grep '64 bytes'"
c1-1: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64 time=0.126 ms c1-12: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64 time=0.067 ms c1-13: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64 time=0.127 ms c1-11: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64 time=0.084 ms c1-4: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64 time=0.090 ms c1-16: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64 time=0.116 ms c1-14: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64 time=0.076 ms c1-2: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64 time=0.113 ms c1-3: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64 time=0.065 ms c1-5: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64 time=0.127 ms c1-17: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64 time=0.046 ms c1-6: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64 time=0.073 ms c1-8: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64 time=0.020 ms c1-15: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64 time=0.109 ms c1-7: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64 time=0.075 ms c1-9: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64 time=0.098 ms

Basically the problem appears to be that MPI_Waitsome is looping forever as it can't seem to see posted completions using IB. With TCP, it appears to have other issues, that are problematic, though don't exhibit themselves with other tests.

I am not sure if this is a bug in the implementation of MPI_Waitsome, though the odd behavior differences between the transports and the scaling observation, suggests some sort of buffer size issue. Are there any specific things we can do to tweak internal OpenMPI buffer sizes to experiment with this? Should I rebuild OpenMPI with -O0? Should I use the Intel compiler for OpenMPI (using gcc 4.1.2 right now)? Main code is in fortran and we are using Intel 10.1.015. Are there any tcp stack issues I should be thinking about to deal with the 113 error (user would be ok with tcp while we get IB ironed out).

Please advise.

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: land...@scalableinformatics.com
web  : http://www.scalableinformatics.com
       http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615

Reply via email to