update 2: (its like I am talking to myself ... :) must start using decaf ...)

Joe Landman wrote:
Joe Landman wrote:

[...]

ok, fixed this. Turns out we have ipoib going, and one adapter needed to be brought down and back up. Now the tcp version appears to be running, though I do get the strange hangs after a random (never the same) number of iterations.

ok, turned off ipoib (OFED 1.2 on this cluster), and disabled ib0 as a tcp port. Now, the --mca btl ^openib,sm setting results in a working code.

This said, we have had no issues in the past with other codes on this cluster running them with OpenMPI on infiniband, over ipoib, or tcp, or shared memory. It appears that this code's use of MPI_Waitsome when using openib simply fails. When we use the same thing with two tcp ports (ipoib and gigabit), it fails at random iterations. Yet when we turn off ipoib, it works (as long as we turn off openib as well).

I am not sure why we have to turn off the sm (shared memory), but without it, the code also fails.

FWIW:  I stuck a few simple

        time_mpi        = MPI_WTIME()

calls in before the MPI_Waitsome calls, to see if this was some sort of timing issue that I could play with.

We don't need ipoib up. It was simply a convenient way to test the IB network without working hard. So I have turned it off for the moment.

Other MPI codes (with simple send/receives) work fine over openib and other btls.



Joe

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: land...@scalableinformatics.com
web  : http://www.scalableinformatics.com
       http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615

Reply via email to