On Wed, May 06, 2009 at 12:15:19PM -0400, Ken Cain wrote: > I am trying to run NetPIPE-3.7.1 NPmpi using Open MPI version 1.3.2 with > the openib btl in an OFED-1.4 environment. The system environment is two > Linux (2.6.27) ppc64 blades, each with one Chelsio RNIC device, > interconnected by a 10GbE switch. The problem is that I cannot (using > Open MPI) establish connections between the two MPI ranks. > > I have already read the OMPI FAQ entries and searched for similar > problems reported to this email list without success. I do have a > compressed config.log that I can provide separately (it is 80KB in size > so I'll spare everyone here). I also have the output of ompi_info --all > that I can share. > > I can successfully run small diagnostic programs such as rping, > ib_rdma_bw, ib_rdma_lat, etc. between the same two blades. I can also > run NPmpi using another MPI library (MVAPICH2) and the Chelsio/iWARP > interface. > > Here is the one example mpirun command line I used: > mpirun --mca orte_base_help_aggregate 0 --mca btl openib,self --hostfile > ~/1usrv_ompi_machfile -np 2 ./NPmpi -p0 -l 1 -u 1024 > outfile1 2>&1 > > and its output: >> -------------------------------------------------------------------------- >> No OpenFabrics connection schemes reported that they were able to be >> used on a specific port. As such, the openib BTL (OpenFabrics >> support) will be disabled for this port. >> >> Local host: aae1 >> Local device: cxgb3_0 >> CPCs attempted: oob, xoob, rdmacm >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> No OpenFabrics connection schemes reported that they were able to be >> used on a specific port. As such, the openib BTL (OpenFabrics >> support) will be disabled for this port. >> >> Local host: aae4 >> Local device: cxgb3_0 >> CPCs attempted: oob, xoob, rdmacm >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> At least one pair of MPI processes are unable to reach each other for >> MPI communications. This means that no Open MPI device has indicated >> that it can be used to communicate between these processes. This is >> an error; Open MPI requires that all MPI processes be able to reach >> each other. This error can sometimes be the result of forgetting to >> specify the "self" BTL. >> >> Process 1 ([[3115,1],0]) is on host: aae4 >> Process 2 ([[3115,1],1]) is on host: aae1 >> BTLs attempted: self >> >> Your MPI job is now going to abort; sorry. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> At least one pair of MPI processes are unable to reach each other for >> MPI communications. This means that no Open MPI device has indicated >> that it can be used to communicate between these processes. This is >> an error; Open MPI requires that all MPI processes be able to reach >> each other. This error can sometimes be the result of forgetting to >> specify the "self" BTL. >> >> Process 1 ([[3115,1],1]) is on host: aae1 >> Process 2 ([[3115,1],0]) is on host: aae4 >> BTLs attempted: self >> >> Your MPI job is now going to abort; sorry. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> It looks like MPI_INIT failed for some reason; your parallel process is >> likely to abort. There are many reasons that a parallel process can >> fail during MPI_INIT; some of which are due to configuration or environment >> problems. This failure appears to be an internal failure; here's some >> additional information (which may only be relevant to an Open MPI >> developer): >> >> PML add procs failed >> --> Returned "Unreachable" (-12) instead of "Success" (0) >> -------------------------------------------------------------------------- >> *** An error occurred in MPI_Init >> -------------------------------------------------------------------------- >> It looks like MPI_INIT failed for some reason; your parallel process is >> likely to abort. There are many reasons that a parallel process can >> fail during MPI_INIT; some of which are due to configuration or environment >> problems. This failure appears to be an internal failure; here's some >> additional information (which may only be relevant to an Open MPI >> developer): >> >> PML add procs failed >> --> Returned "Unreachable" (-12) instead of "Success" (0) >> -------------------------------------------------------------------------- >> *** An error occurred in MPI_Init >> *** before MPI was initialized >> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) >> *** before MPI was initialized >> [aae1:6598] Abort before MPI_INIT completed successfully; not able to >> guarantee that all other processes were killed! >> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) >> [aae4:19434] Abort before MPI_INIT completed successfully; not able to >> guarantee that all other processes were killed! >> -------------------------------------------------------------------------- >> mpirun has exited due to process rank 0 with PID 19434 on >> node aae4 exiting without calling "finalize". This may >> have caused other processes in the application to be >> terminated by signals sent by mpirun (as reported here). >> -------------------------------------------------------------------------- > > > > Here is the another mpirun command I used (adding verbosity and more > specific btl parameters): > > mpirun --mca orte_base_help_aggregate 0 --mca btl openib,self,sm --mca > btl_base_verbose 10 --mca btl_openib_verbose 10 --mca > btl_openib_if_include cxgb3_0:1 --mca btl_openib_cpc_include rdmacm > --mca btl_openib_device_type iwarp --mca btl_openib_max_btls 1 --mca > mpi_leave_pinned 1 --hostfile ~/1usrv_ompi_machfile -np 2 ./NPmpi -p0 -l > 1 -u 1024 > ~/outfile2 2>&1
It looks like you are only using 1 port on the Chelsio RNIC. Based on the messages above, It looks like it might be the wrong port. Is there a reason why you are excluding it? Also, you might try the TCP btl and verify that it works correctly in the testcase (as a point of reference). Thanks, Jon > > and its output: >> [aae4:19426] mca: base: components_open: Looking for btl components >> [aae4:19426] mca: base: components_open: opening btl components >> [aae4:19426] mca: base: components_open: found loaded component openib >> [aae4:19426] mca: base: components_open: component openib has no register >> function >> [aae4:19426] mca: base: components_open: component openib open function >> successful >> [aae4:19426] mca: base: components_open: found loaded component self >> [aae4:19426] mca: base: components_open: component self has no register >> function >> [aae4:19426] mca: base: components_open: component self open function >> successful >> [aae4:19426] mca: base: components_open: found loaded component sm >> [aae4:19426] mca: base: components_open: component sm has no register >> function >> [aae4:19426] mca: base: components_open: component sm open function >> successful >> [aae1:06503] mca: base: components_open: Looking for btl components >> [aae1:06503] mca: base: components_open: opening btl components >> [aae1:06503] mca: base: components_open: found loaded component openib >> [aae1:06503] mca: base: components_open: component openib has no register >> function >> [aae1:06503] mca: base: components_open: component openib open function >> successful >> [aae1:06503] mca: base: components_open: found loaded component self >> [aae1:06503] mca: base: components_open: component self has no register >> function >> [aae1:06503] mca: base: components_open: component self open function >> successful >> [aae1:06503] mca: base: components_open: found loaded component sm >> [aae1:06503] mca: base: components_open: component sm has no register >> function >> [aae1:06503] mca: base: components_open: component sm open function >> successful >> [aae4:19426] select: initializing btl component openib >> [aae4][[3107,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying >> INI files for vendor 0x1425, part ID 49 >> [aae4][[3107,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found >> corresponding INI values: Chelsio T3 >> [aae4][[3107,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying >> INI files for vendor 0x0000, part ID 0 >> [aae4][[3107,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found >> corresponding INI values: default >> [aae4:19426] openib BTL: rdmacm CPC available for use on cxgb3_0 >> [aae4:19426] select: init of component openib returned success >> [aae4:19426] select: initializing btl component self >> [aae4:19426] select: init of component self returned success >> [aae4:19426] select: initializing btl component sm >> [aae4:19426] select: init of component sm returned success >> [aae1:06503] select: initializing btl component openib >> [aae1][[3107,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying >> INI files for vendor 0x1425, part ID 49 >> [aae1][[3107,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found >> corresponding INI values: Chelsio T3 >> [aae1][[3107,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying >> INI files for vendor 0x0000, part ID 0 >> [aae1][[3107,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found >> corresponding INI values: default >> [aae1:06503] openib BTL: rdmacm CPC available for use on cxgb3_0 >> [aae1:06503] select: init of component openib returned success >> [aae1:06503] select: initializing btl component self >> [aae1:06503] select: init of component self returned success >> [aae1:06503] select: initializing btl component sm >> [aae1:06503] select: init of component sm returned success >> -------------------------------------------------------------------------- >> At least one pair of MPI processes are unable to reach each other for >> MPI communications. This means that no Open MPI device has indicated >> that it can be used to communicate between these processes. This is >> an error; Open MPI requires that all MPI processes be able to reach >> each other. This error can sometimes be the result of forgetting to >> specify the "self" BTL. >> >> Process 1 ([[3107,1],0]) is on host: aae4 >> Process 2 ([[3107,1],1]) is on host: aae1 >> BTLs attempted: openib self sm >> >> Your MPI job is now going to abort; sorry. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> At least one pair of MPI processes are unable to reach each other for >> MPI communications. This means that no Open MPI device has indicated >> that it can be used to communicate between these processes. This is >> an error; Open MPI requires that all MPI processes be able to reach >> each other. This error can sometimes be the result of forgetting to >> specify the "self" BTL. >> >> Process 1 ([[3107,1],1]) is on host: aae1 >> Process 2 ([[3107,1],0]) is on host: aae4 >> BTLs attempted: openib self sm >> >> Your MPI job is now going to abort; sorry. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> It looks like MPI_INIT failed for some reason; your parallel process is >> likely to abort. There are many reasons that a parallel process can >> fail during MPI_INIT; some of which are due to configuration or environment >> problems. This failure appears to be an internal failure; here's some >> additional information (which may only be relevant to an Open MPI >> developer): >> >> PML add procs failed >> --> Returned "Unreachable" (-12) instead of "Success" (0) >> -------------------------------------------------------------------------- >> *** An error occurred in MPI_Init >> *** before MPI was initialized >> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) >> -------------------------------------------------------------------------- >> It looks like MPI_INIT failed for some reason; your parallel process is >> likely to abort. There are many reasons that a parallel process can >> fail during MPI_INIT; some of which are due to configuration or environment >> problems. This failure appears to be an internal failure; here's some >> additional information (which may only be relevant to an Open MPI >> developer): >> >> PML add procs failed >> --> Returned "Unreachable" (-12) instead of "Success" (0) >> -------------------------------------------------------------------------- >> *** An error occurred in MPI_Init >> *** before MPI was initialized >> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) >> [aae1:6503] Abort before MPI_INIT completed successfully; not able to >> guarantee that all other processes were killed! >> [aae4:19426] Abort before MPI_INIT completed successfully; not able to >> guarantee that all other processes were killed! >> -------------------------------------------------------------------------- >> mpirun has exited due to process rank 0 with PID 19426 on >> node aae4 exiting without calling "finalize". This may >> have caused other processes in the application to be >> terminated by signals sent by mpirun (as reported here). >> -------------------------------------------------------------------------- > > > > Thanks for any advice/help you can offer. > > > -Ken > > This message is intended only for the designated recipient(s) and may > contain confidential or proprietary information of Mercury Computer > Systems, Inc. This message is solely intended to facilitate business > discussions and does not constitute an express or implied offer to sell > or purchase any products, services, or support. Any commitments must be > made in writing and signed by duly authorized representatives of each > party. > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users