On Monday, March 21, 2011 12:25:37 pm Dave Love wrote: > I'm trying to test some new nodes with ConnectX adaptors, and failing to > get (so far just) IMB to run on them. ... > I'm using gcc-compiled OMPI 1.4.3 and the current RedHat 5 OFED with IMB > 3.2.2, specifying `btl openib,sm,self' (or `mtl psm' on the Qlogic > nodes). I'm not sure what else might be relevant. The output from > trying to run IMB follows, for what it's worth. > > > -------------------------------------------------------------------------- > At least one pair of MPI processes are unable to reach each other for MPI > communications. This means that no Open MPI device has indicated that it > can be used to communicate between these processes. This is an error; > Open MPI requires that all MPI processes be able to reach each other. > This error can sometimes be the result of forgetting to specify the "self" > BTL. > > Process 1 ([[25307,1],2]) is on host: lvgig116 > Process 2 ([[25307,1],12]) is on host: lvgig117 > BTLs attempted: self sm
Are you sure you launched it correctly and that you have (re)built OpenMPI against your Redhat-5 ib stack? > Your MPI job is now going to abort; sorry. ... > [lvgig116:07931] 19 more processes have sent help message > help-mca-bml-r2.txt / unreachable proc [lvgig116:07931] Set MCA parameter Seems to me that OpenMPI gave up because it didn't succeed in initializing any inter-node btl/mtl. I'd suggest you try (roughly in order): 1) ibstat on all nodes to verify that your ib interfaces are up 2) try a verbs level test (like ib_write_bw) to verify data can flow 3) make sure your OpenMPI was built with the redhat libibverbs-devel present (=> a suitable openib btl is built). /Peter > "orte_base_help_aggregate" to 0 to see all help / error messages > [lvgig116:07931] 19 more processes have sent help message help-mpi-runtime > / mpi_init:startup:internal-failure
signature.asc
Description: This is a digitally signed message part.