On Apr 16, 2006, at 1:29 PM, Lee D. Peterson wrote:
Thanks for your help. The hanging problem came back again a day ago. However, I can now run only if I use either "-mca btl_tcp_if_include en0" or "-mca btl_tcp_if_include en1". Using btl_tcp_if_exclude on either en0 or en1 doesn't work.
That's very strange. What happens if you run with "-mca btl_tcp_if_include en0,en1", which will use both devices. The fact that the exclude option doesn't work makes me wonder if there isn't another device that appears active somewhere in the cluster. The most likely suspect on an OS X cluster is a firewire device that somehow has sprouted an address and gotten marked as active. You might want to run "ifconfig -a" on all your nodes and make sure the output is mostly the same.
Regarding the TCP performance, I ran the HPL benchmark again and see typically 85% to 90% of the LAM-MPI speed, provided the problem size isn't too small.
That would make sense - LAM/MPI can exhibit much better latency in very specific situations than Open MPI (on TCP - on other interconnects, Open MPI is much faster). We're working on optimizing our TCP stack, but up until now, the high-speed interconnects have been the major concern.
Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/