update 2: (its like I am talking to myself ... :) must start using
decaf ...)
Joe Landman wrote:
Joe Landman wrote:
[...]
ok, fixed this. Turns out we have ipoib going, and one adapter needed
to be brought down and back up. Now the tcp version appears to be
running, though I do get the strange hangs after a random (never the
same) number of iterations.
ok, turned off ipoib (OFED 1.2 on this cluster), and disabled ib0 as a
tcp port. Now, the --mca btl ^openib,sm setting results in a working
code.
This said, we have had no issues in the past with other codes on this
cluster running them with OpenMPI on infiniband, over ipoib, or tcp, or
shared memory. It appears that this code's use of MPI_Waitsome when
using openib simply fails. When we use the same thing with two tcp
ports (ipoib and gigabit), it fails at random iterations. Yet when we
turn off ipoib, it works (as long as we turn off openib as well).
I am not sure why we have to turn off the sm (shared memory), but
without it, the code also fails.
FWIW: I stuck a few simple
time_mpi = MPI_WTIME()
calls in before the MPI_Waitsome calls, to see if this was some sort of
timing issue that I could play with.
We don't need ipoib up. It was simply a convenient way to test the IB
network without working hard. So I have turned it off for the moment.
Other MPI codes (with simple send/receives) work fine over openib and
other btls.
Joe
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: land...@scalableinformatics.com
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 866 888 3112
cell : +1 734 612 4615