I'm trying to test some new nodes with ConnectX adaptors, and failing to
get (so far just) IMB to run on them.

The binary runs on the same cluster using TCP, or using PSM on some
other IB nodes.  A rebuilt PMB and various existing binaries work with
openib on the ConnectX nodes running it exactly the same way as IMB.
I.e. this seems to be something specific to IMB and openib.

It seems rather bizarre, and I have no idea how to debug it in the
absence of hints from a web search, i.e. why has it failed to attempt
the openib BTL in this case.  I can't get any openib-related information
using obvious MCA verbosity flags.  Can anyone make suggestions?

I'm using gcc-compiled OMPI 1.4.3 and the current RedHat 5 OFED with IMB
3.2.2, specifying `btl openib,sm,self' (or `mtl psm' on the Qlogic
nodes).  I'm not sure what else might be relevant.  The output from
trying to run IMB follows, for what it's worth.

  --------------------------------------------------------------------------
  At least one pair of MPI processes are unable to reach each other for
  MPI communications.  This means that no Open MPI device has indicated
  that it can be used to communicate between these processes.  This is
  an error; Open MPI requires that all MPI processes be able to reach
  each other.  This error can sometimes be the result of forgetting to
  specify the "self" BTL.

    Process 1 ([[25307,1],2]) is on host: lvgig116
    Process 2 ([[25307,1],12]) is on host: lvgig117
    BTLs attempted: self sm

  Your MPI job is now going to abort; sorry.
  --------------------------------------------------------------------------
  --------------------------------------------------------------------------
  It looks like MPI_INIT failed for some reason; your parallel process is
  likely to abort.  There are many reasons that a parallel process can
  fail during MPI_INIT; some of which are due to configuration or environment
  problems.  This failure appears to be an internal failure; here's some
  additional information (which may only be relevant to an Open MPI
  developer):

    PML add procs failed
    --> Returned "Unreachable" (-12) instead of "Success" (0)
  --------------------------------------------------------------------------
  *** The MPI_Init_thread() function was called before MPI_INIT was invoked.
  *** This is disallowed by the MPI standard.
  *** Your MPI job will now abort.
  [lvgig116:8052] Abort before MPI_INIT completed successfully; not able to 
guarantee that all other processes were killed!
  *** The MPI_Init_thread() function was called before MPI_INIT was invoked.
  *** This is disallowed by the MPI standard.
  *** Your MPI job will now abort.

  ...

  [lvgig116:07931] 19 more processes have sent help message help-mca-bml-r2.txt 
/ unreachable proc
  [lvgig116:07931] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
help / error messages
  [lvgig116:07931] 19 more processes have sent help message help-mpi-runtime / 
mpi_init:startup:internal-failure

Reply via email to