Re: [OMPI users] MPI Jobs Hang on OS X XServe Cluster

Brian Barrett Sun, 16 Apr 2006 16:47:52 -0400

On Apr 16, 2006, at 1:29 PM, Lee D. Peterson wrote:

Thanks for your help. The hanging problem came back again a day ago.
However, I can now run only if I use either "-mca btl_tcp_if_include
en0" or "-mca btl_tcp_if_include en1". Using btl_tcp_if_exclude on
either en0 or en1 doesn't work.

That's very strange. What happens if you run with "-mcabtl_tcp_if_include en0,en1", which will use both devices. The factthat the exclude option doesn't work makes me wonder if there isn'tanother device that appears active somewhere in the cluster. Themost likely suspect on an OS X cluster is a firewire device thatsomehow has sprouted an address and gotten marked as active. Youmight want to run "ifconfig -a" on all your nodes and make sure theoutput is mostly the same.

Regarding the TCP performance, I ran the HPL benchmark again and see
typically 85% to 90% of the LAM-MPI speed, provided the problem size
isn't too small.

That would make sense - LAM/MPI can exhibit much better latency invery specific situations than Open MPI (on TCP - on otherinterconnects, Open MPI is much faster). We're working on optimizingour TCP stack, but up until now, the high-speed interconnects havebeen the major concern.



Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/

Re: [OMPI users] MPI Jobs Hang on OS X XServe Cluster

Reply via email to