On Mon, Dec 03, 2007 at 02:44:37PM -0600, Jon Mason wrote:
> I'm seeing a crash in the openib btl on ompi-trunk when running any
> tests (whether running my own programs or generic ones).  For example,
> when running IMB pingpong I get the following:
> 
> $ mpirun --n 2 --host vic12,vic20 -mca btl openib,self
> # /usr/mpi/gcc/openmpi-trunk/tests/IMB-2.3/IMB-MPI1 pingpong
> --------------------------------------------------------------------------
> WARNING: No HCA parameters were found for the HCA that Open MPI
> detected:
> 
>     Hostname:           vic20
>     HCA vendor ID:      0x1425
>     HCA vendor part ID: 48
> 
> Default HCA parameters will be used, which may result in lower
> performance.  You can edit any of the files specified by the
> btl_openib_hca_param_files MCA parameter to set values for your HCA.
> 
> NOTE: You can turn off this warning by setting the MCA parameter
>       btl_openib_warn_no_hca_params_found to 0.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> WARNING: No HCA parameters were found for the HCA that Open MPI
> detected:
> 
>     Hostname:           vic12
>     HCA vendor ID:      0x1425
>     HCA vendor part ID: 48
> 
> Default HCA parameters will be used, which may result in lower
> performance.  You can edit any of the files specified by the
> btl_openib_hca_param_files MCA parameter to set values for your HCA.
> 
> NOTE: You can turn off this warning by setting the MCA parameter
>       btl_openib_warn_no_hca_params_found to 0.
> --------------------------------------------------------------------------
> [vic20:04339] *** Process received signal ***
> [vic12:04539] *** Process received signal ***
> [vic12:04539] Signal: Segmentation fault (11)
> [vic12:04539] Signal code: Address not mapped (1)
> [vic12:04539] Failing at address: 0xffffffffffffffea
> [vic20:04339] Signal: Segmentation fault (11)
> [vic20:04339] Signal code: Address not mapped (1)
> [vic20:04339] Failing at address: 0xffffffffffffffea
> [vic20:04339] [ 0] /lib64/libpthread.so.0 [0x35db80dd40]
> [vic20:04339] [ 1] /usr/lib64/libibverbs.so.1(ibv_create_srq+0x3e)
> [0x32b7e083be]
> [vic20:04339] [ 2]
> /usr/mpi/gcc/openmpi-trunk/lib/openmpi/mca_btl_openib.so
> [0x2aaaaf0bdc27]
> [vic20:04339] [ 3]
> /usr/mpi/gcc/openmpi-trunk/lib/openmpi/mca_btl_openib.so
> [0x2aaaaf0be07e]
> [vic20:04339] [ 4]
> /usr/mpi/gcc/openmpi-trunk/lib/openmpi/mca_btl_openib.so(mca_btl_openib_add_procs+0x857)
> [0x2aaaaf0bd97c]
> [vic20:04339] [ 5]
> /usr/mpi/gcc/openmpi-trunk/lib/openmpi/mca_bml_r2.so(mca_bml_r2_add_procs+0x37d)
> [0x2aaaaeeb399e]
> [vic20:04339] [ 6]
> /usr/mpi/gcc/openmpi-trunk/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0x15c)
> [0x2aaaaec9036b]
> [vic20:04339] [ 7]
> /usr/mpi/gcc/openmpi-trunk/lib64/libmpi.so.0(ompi_mpi_init+0xb2b)
> [0x2aaaaab03817]
> [vic20:04339] [ 8]
> /usr/mpi/gcc/openmpi-trunk/lib64/libmpi.so.0(MPI_Init+0x15d)
> [0x2aaaaab44dc9]
> [vic20:04339] [ 9]
> /usr/mpi/gcc/openmpi-trunk/tests/IMB-2.3/IMB-MPI1(main+0x29) [0x402df9]
> [vic20:04339] [10] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x35dac1d8a4]
> [vic20:04339] [11] /usr/mpi/gcc/openmpi-trunk/tests/IMB-2.3/IMB-MPI1
> [0x402d39]
> [vic20:04339] *** End of error message ***
> [vic12:04539] [ 0] /lib64/libpthread.so.0 [0x3a7dc0dd40]
> [vic12:04539] [ 1] /usr/lib64/libibverbs.so.1(ibv_create_srq+0x3e)
> [0x3e82e083be]
> [vic12:04539] [ 2]
> /usr/mpi/gcc/openmpi-trunk/lib/openmpi/mca_btl_openib.so
> [0x2aaaaf0bdc27]
> [vic12:04539] [ 3]
> /usr/mpi/gcc/openmpi-trunk/lib/openmpi/mca_btl_openib.so
> [0x2aaaaf0be07e]
> [vic12:04539] [ 4]
> /usr/mpi/gcc/openmpi-trunk/lib/openmpi/mca_btl_openib.so(mca_btl_openib_add_procs+0x857)
> [0x2aaaaf0bd97c]
> [vic12:04539] [ 5]
> /usr/mpi/gcc/openmpi-trunk/lib/openmpi/mca_bml_r2.so(mca_bml_r2_add_procs+0x37d)
> [0x2aaaaeeb399e]
> [vic12:04539] [ 6]
> /usr/mpi/gcc/openmpi-trunk/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0x15c)
> [0x2aaaaec9036b]
> [vic12:04539] [ 7]
> /usr/mpi/gcc/openmpi-trunk/lib64/libmpi.so.0(ompi_mpi_init+0xb2b)
> [0x2aaaaab03817]
> [vic12:04539] [ 8]
> /usr/mpi/gcc/openmpi-trunk/lib64/libmpi.so.0(MPI_Init+0x15d)
> [0x2aaaaab44dc9]
> [vic12:04539] [ 9]
> /usr/mpi/gcc/openmpi-trunk/tests/IMB-2.3/IMB-MPI1(main+0x29) [0x402df9]
> [vic12:04539] [10] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x3a7d01d8a4]
> [vic12:04539] [11] /usr/mpi/gcc/openmpi-trunk/tests/IMB-2.3/IMB-MPI1
> [0x402d39]
> [vic12:04539] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 1 with PID 4339 on
> node vic20 calling "abort". This will have caused other processes
> in the application to be terminated by signals sent by mpirun
> (as reported here).
> --------------------------------------------------------------------------
> 
> I am not having any problems running this test with the openib btl on
> the ompi-1.2 branch, and I can run this test successfully with the udapl
> and tcp btls on ompi-trunk.  Is anyone else seeing this problem?

To answer my own question (with help from Jeff), the problem is caused
by OMPI trying to use the iwarp interfaces (which currently do not
work in openib).  In previous versions, it only tried the IB
interfaces (which do work).  When I limited my tests to only use the
IB interface, the test ran successfully.

For the lazy,
# mpirun --n 2 --host vic12,vic20 -mca btl openib,self --mca 
btl_openib_if_include mthca0 /usr/mpi/gcc/openmpi-trunk/tests/IMB-2.3/IMB-MPI1 
pingpong 

Much thanks to Jeff,
Jon

> 
> Thanks,
> Jon
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to