I'm trying to test some new nodes with ConnectX adaptors, and failing to get (so far just) IMB to run on them.
The binary runs on the same cluster using TCP, or using PSM on some other IB nodes. A rebuilt PMB and various existing binaries work with openib on the ConnectX nodes running it exactly the same way as IMB. I.e. this seems to be something specific to IMB and openib. It seems rather bizarre, and I have no idea how to debug it in the absence of hints from a web search, i.e. why has it failed to attempt the openib BTL in this case. I can't get any openib-related information using obvious MCA verbosity flags. Can anyone make suggestions? I'm using gcc-compiled OMPI 1.4.3 and the current RedHat 5 OFED with IMB 3.2.2, specifying `btl openib,sm,self' (or `mtl psm' on the Qlogic nodes). I'm not sure what else might be relevant. The output from trying to run IMB follows, for what it's worth. -------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[25307,1],2]) is on host: lvgig116 Process 2 ([[25307,1],12]) is on host: lvgig117 BTLs attempted: self sm Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -------------------------------------------------------------------------- *** The MPI_Init_thread() function was called before MPI_INIT was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. [lvgig116:8052] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! *** The MPI_Init_thread() function was called before MPI_INIT was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. ... [lvgig116:07931] 19 more processes have sent help message help-mca-bml-r2.txt / unreachable proc [lvgig116:07931] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [lvgig116:07931] 19 more processes have sent help message help-mpi-runtime / mpi_init:startup:internal-failure