On Wed, May 06, 2009 at 12:15:19PM -0400, Ken Cain wrote:
> I am trying to run NetPIPE-3.7.1 NPmpi using Open MPI version 1.3.2 with  
> the openib btl in an OFED-1.4 environment. The system environment is two  
> Linux (2.6.27) ppc64 blades, each with one Chelsio RNIC device,  
> interconnected by a 10GbE switch. The problem is that I cannot (using  
> Open MPI) establish connections between the two MPI ranks.
>
> I have already read the OMPI FAQ entries and searched for similar  
> problems reported to this email list without success. I do have a  
> compressed config.log that I can provide separately (it is 80KB in size  
> so I'll spare everyone here). I also have the output of ompi_info --all  
> that I can share.
>
> I can successfully run small diagnostic programs such as rping,  
> ib_rdma_bw, ib_rdma_lat, etc. between the same two blades. I can also  
> run NPmpi using another MPI library (MVAPICH2) and the Chelsio/iWARP  
> interface.
>
> Here is the one example mpirun command line I used:
> mpirun --mca orte_base_help_aggregate 0 --mca btl openib,self --hostfile  
> ~/1usrv_ompi_machfile -np 2 ./NPmpi -p0 -l 1 -u 1024 > outfile1 2>&1
>
> and its output:
>> --------------------------------------------------------------------------
>> No OpenFabrics connection schemes reported that they were able to be
>> used on a specific port.  As such, the openib BTL (OpenFabrics
>> support) will be disabled for this port.
>>
>>   Local host:           aae1
>>   Local device:         cxgb3_0
>>   CPCs attempted:       oob, xoob, rdmacm
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> No OpenFabrics connection schemes reported that they were able to be
>> used on a specific port.  As such, the openib BTL (OpenFabrics
>> support) will be disabled for this port.
>>
>>   Local host:           aae4
>>   Local device:         cxgb3_0
>>   CPCs attempted:       oob, xoob, rdmacm
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications.  This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes.  This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other.  This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>>
>>   Process 1 ([[3115,1],0]) is on host: aae4
>>   Process 2 ([[3115,1],1]) is on host: aae1
>>   BTLs attempted: self
>>
>> Your MPI job is now going to abort; sorry.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications.  This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes.  This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other.  This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>>
>>   Process 1 ([[3115,1],1]) is on host: aae1
>>   Process 2 ([[3115,1],0]) is on host: aae4
>>   BTLs attempted: self
>>
>> Your MPI job is now going to abort; sorry.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>>   PML add procs failed
>>   --> Returned "Unreachable" (-12) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> *** An error occurred in MPI_Init
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>>   PML add procs failed
>>   --> Returned "Unreachable" (-12) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> *** An error occurred in MPI_Init
>> *** before MPI was initialized
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> *** before MPI was initialized
>> [aae1:6598] Abort before MPI_INIT completed successfully; not able to 
>> guarantee that all other processes were killed!
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> [aae4:19434] Abort before MPI_INIT completed successfully; not able to 
>> guarantee that all other processes were killed!
>> --------------------------------------------------------------------------
>> mpirun has exited due to process rank 0 with PID 19434 on
>> node aae4 exiting without calling "finalize". This may
>> have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>> --------------------------------------------------------------------------
>
>
>
> Here is the another mpirun command I used (adding verbosity and more  
> specific btl parameters):
>
> mpirun --mca orte_base_help_aggregate 0 --mca btl openib,self,sm --mca  
> btl_base_verbose 10 --mca btl_openib_verbose 10 --mca  
> btl_openib_if_include cxgb3_0:1 --mca btl_openib_cpc_include rdmacm  
> --mca btl_openib_device_type iwarp --mca btl_openib_max_btls 1 --mca  
> mpi_leave_pinned 1 --hostfile ~/1usrv_ompi_machfile -np 2 ./NPmpi -p0 -l  
> 1 -u 1024 > ~/outfile2 2>&1

It looks like you are only using 1 port on the Chelsio RNIC.  Based on
the messages above, It looks like it might be the wrong port.  Is there
a reason why you are excluding it?  Also, you might try the TCP btl and
verify that it works correctly in the testcase (as a point of
reference).

Thanks,
Jon

>
> and its output:
>> [aae4:19426] mca: base: components_open: Looking for btl components
>> [aae4:19426] mca: base: components_open: opening btl components
>> [aae4:19426] mca: base: components_open: found loaded component openib
>> [aae4:19426] mca: base: components_open: component openib has no register 
>> function
>> [aae4:19426] mca: base: components_open: component openib open function 
>> successful
>> [aae4:19426] mca: base: components_open: found loaded component self
>> [aae4:19426] mca: base: components_open: component self has no register 
>> function
>> [aae4:19426] mca: base: components_open: component self open function 
>> successful
>> [aae4:19426] mca: base: components_open: found loaded component sm
>> [aae4:19426] mca: base: components_open: component sm has no register 
>> function
>> [aae4:19426] mca: base: components_open: component sm open function 
>> successful
>> [aae1:06503] mca: base: components_open: Looking for btl components
>> [aae1:06503] mca: base: components_open: opening btl components
>> [aae1:06503] mca: base: components_open: found loaded component openib
>> [aae1:06503] mca: base: components_open: component openib has no register 
>> function
>> [aae1:06503] mca: base: components_open: component openib open function 
>> successful
>> [aae1:06503] mca: base: components_open: found loaded component self
>> [aae1:06503] mca: base: components_open: component self has no register 
>> function
>> [aae1:06503] mca: base: components_open: component self open function 
>> successful
>> [aae1:06503] mca: base: components_open: found loaded component sm
>> [aae1:06503] mca: base: components_open: component sm has no register 
>> function
>> [aae1:06503] mca: base: components_open: component sm open function 
>> successful
>> [aae4:19426] select: initializing btl component openib
>> [aae4][[3107,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying 
>> INI files for vendor 0x1425, part ID 49
>> [aae4][[3107,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found 
>> corresponding INI values: Chelsio T3
>> [aae4][[3107,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying 
>> INI files for vendor 0x0000, part ID 0
>> [aae4][[3107,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found 
>> corresponding INI values: default
>> [aae4:19426] openib BTL: rdmacm CPC available for use on cxgb3_0
>> [aae4:19426] select: init of component openib returned success
>> [aae4:19426] select: initializing btl component self
>> [aae4:19426] select: init of component self returned success
>> [aae4:19426] select: initializing btl component sm
>> [aae4:19426] select: init of component sm returned success
>> [aae1:06503] select: initializing btl component openib
>> [aae1][[3107,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying 
>> INI files for vendor 0x1425, part ID 49
>> [aae1][[3107,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found 
>> corresponding INI values: Chelsio T3
>> [aae1][[3107,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying 
>> INI files for vendor 0x0000, part ID 0
>> [aae1][[3107,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found 
>> corresponding INI values: default
>> [aae1:06503] openib BTL: rdmacm CPC available for use on cxgb3_0
>> [aae1:06503] select: init of component openib returned success
>> [aae1:06503] select: initializing btl component self
>> [aae1:06503] select: init of component self returned success
>> [aae1:06503] select: initializing btl component sm
>> [aae1:06503] select: init of component sm returned success
>> --------------------------------------------------------------------------
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications.  This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes.  This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other.  This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>>
>>   Process 1 ([[3107,1],0]) is on host: aae4
>>   Process 2 ([[3107,1],1]) is on host: aae1
>>   BTLs attempted: openib self sm
>>
>> Your MPI job is now going to abort; sorry.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications.  This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes.  This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other.  This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>>
>>   Process 1 ([[3107,1],1]) is on host: aae1
>>   Process 2 ([[3107,1],0]) is on host: aae4
>>   BTLs attempted: openib self sm
>>
>> Your MPI job is now going to abort; sorry.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>>   PML add procs failed
>>   --> Returned "Unreachable" (-12) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> *** An error occurred in MPI_Init
>> *** before MPI was initialized
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>>   PML add procs failed
>>   --> Returned "Unreachable" (-12) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> *** An error occurred in MPI_Init
>> *** before MPI was initialized
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> [aae1:6503] Abort before MPI_INIT completed successfully; not able to 
>> guarantee that all other processes were killed!
>> [aae4:19426] Abort before MPI_INIT completed successfully; not able to 
>> guarantee that all other processes were killed!
>> --------------------------------------------------------------------------
>> mpirun has exited due to process rank 0 with PID 19426 on
>> node aae4 exiting without calling "finalize". This may
>> have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>> --------------------------------------------------------------------------
>
>
>
> Thanks for any advice/help you can offer.
>
>
> -Ken
>
> This message is intended only for the designated recipient(s) and may
> contain confidential or proprietary information of Mercury Computer
> Systems, Inc. This message is solely intended to facilitate business
> discussions and does not constitute an express or implied offer to sell
> or purchase any products, services, or support. Any commitments must be
> made in writing and signed by duly authorized representatives of each
> party.
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to