I'm trying to test some new nodes with ConnectX adaptors, and failing to
get (so far just) IMB to run on them.
The binary runs on the same cluster using TCP, or using PSM on some
other IB nodes. A rebuilt PMB and various existing binaries work with
openib on the ConnectX nodes running it exactly the same way as IMB.
I.e. this seems to be something specific to IMB and openib.
It seems rather bizarre, and I have no idea how to debug it in the
absence of hints from a web search, i.e. why has it failed to attempt
the openib BTL in this case. I can't get any openib-related information
using obvious MCA verbosity flags. Can anyone make suggestions?
I'm using gcc-compiled OMPI 1.4.3 and the current RedHat 5 OFED with IMB
3.2.2, specifying `btl openib,sm,self' (or `mtl psm' on the Qlogic
nodes). I'm not sure what else might be relevant. The output from
trying to run IMB follows, for what it's worth.
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[25307,1],2]) is on host: lvgig116
Process 2 ([[25307,1],12]) is on host: lvgig117
BTLs attempted: self sm
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** The MPI_Init_thread() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[lvgig116:8052] Abort before MPI_INIT completed successfully; not able to
guarantee that all other processes were killed!
*** The MPI_Init_thread() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
...
[lvgig116:07931] 19 more processes have sent help message help-mca-bml-r2.txt
/ unreachable proc
[lvgig116:07931] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
help / error messages
[lvgig116:07931] 19 more processes have sent help message help-mpi-runtime /
mpi_init:startup:internal-failure