An update: I recoded the mpi_waitall as a loop over the requests with
mpi_test and a 30 second timeout. The timeout happens unpredictably,
sometimes after 10 minutes of run time, other times after 15 minutes, for
the exact same case.
After 30 seconds, I print out the status of all outstanding rec
Problem solved. I did not configure with --with-mxm=/opt/mellanox/mcm and
this location was not auto-detected. Once I rebuilt with this option,
everything worked fine. Scaled better than MVAPICH out to 800. MVAPICH
configure log showed that it had found this component of the OFED stack.
Ed
> If