On Mon, Dec 03, 2007 at 02:44:37PM -0600, Jon Mason wrote: > I'm seeing a crash in the openib btl on ompi-trunk when running any > tests (whether running my own programs or generic ones). For example, > when running IMB pingpong I get the following: > > $ mpirun --n 2 --host vic12,vic20 -mca btl openib,self > # /usr/mpi/gcc/openmpi-trunk/tests/IMB-2.3/IMB-MPI1 pingpong > -------------------------------------------------------------------------- > WARNING: No HCA parameters were found for the HCA that Open MPI > detected: > > Hostname: vic20 > HCA vendor ID: 0x1425 > HCA vendor part ID: 48 > > Default HCA parameters will be used, which may result in lower > performance. You can edit any of the files specified by the > btl_openib_hca_param_files MCA parameter to set values for your HCA. > > NOTE: You can turn off this warning by setting the MCA parameter > btl_openib_warn_no_hca_params_found to 0. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > WARNING: No HCA parameters were found for the HCA that Open MPI > detected: > > Hostname: vic12 > HCA vendor ID: 0x1425 > HCA vendor part ID: 48 > > Default HCA parameters will be used, which may result in lower > performance. You can edit any of the files specified by the > btl_openib_hca_param_files MCA parameter to set values for your HCA. > > NOTE: You can turn off this warning by setting the MCA parameter > btl_openib_warn_no_hca_params_found to 0. > -------------------------------------------------------------------------- > [vic20:04339] *** Process received signal *** > [vic12:04539] *** Process received signal *** > [vic12:04539] Signal: Segmentation fault (11) > [vic12:04539] Signal code: Address not mapped (1) > [vic12:04539] Failing at address: 0xffffffffffffffea > [vic20:04339] Signal: Segmentation fault (11) > [vic20:04339] Signal code: Address not mapped (1) > [vic20:04339] Failing at address: 0xffffffffffffffea > [vic20:04339] [ 0] /lib64/libpthread.so.0 [0x35db80dd40] > [vic20:04339] [ 1] /usr/lib64/libibverbs.so.1(ibv_create_srq+0x3e) > [0x32b7e083be] > [vic20:04339] [ 2] > /usr/mpi/gcc/openmpi-trunk/lib/openmpi/mca_btl_openib.so > [0x2aaaaf0bdc27] > [vic20:04339] [ 3] > /usr/mpi/gcc/openmpi-trunk/lib/openmpi/mca_btl_openib.so > [0x2aaaaf0be07e] > [vic20:04339] [ 4] > /usr/mpi/gcc/openmpi-trunk/lib/openmpi/mca_btl_openib.so(mca_btl_openib_add_procs+0x857) > [0x2aaaaf0bd97c] > [vic20:04339] [ 5] > /usr/mpi/gcc/openmpi-trunk/lib/openmpi/mca_bml_r2.so(mca_bml_r2_add_procs+0x37d) > [0x2aaaaeeb399e] > [vic20:04339] [ 6] > /usr/mpi/gcc/openmpi-trunk/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0x15c) > [0x2aaaaec9036b] > [vic20:04339] [ 7] > /usr/mpi/gcc/openmpi-trunk/lib64/libmpi.so.0(ompi_mpi_init+0xb2b) > [0x2aaaaab03817] > [vic20:04339] [ 8] > /usr/mpi/gcc/openmpi-trunk/lib64/libmpi.so.0(MPI_Init+0x15d) > [0x2aaaaab44dc9] > [vic20:04339] [ 9] > /usr/mpi/gcc/openmpi-trunk/tests/IMB-2.3/IMB-MPI1(main+0x29) [0x402df9] > [vic20:04339] [10] /lib64/libc.so.6(__libc_start_main+0xf4) > [0x35dac1d8a4] > [vic20:04339] [11] /usr/mpi/gcc/openmpi-trunk/tests/IMB-2.3/IMB-MPI1 > [0x402d39] > [vic20:04339] *** End of error message *** > [vic12:04539] [ 0] /lib64/libpthread.so.0 [0x3a7dc0dd40] > [vic12:04539] [ 1] /usr/lib64/libibverbs.so.1(ibv_create_srq+0x3e) > [0x3e82e083be] > [vic12:04539] [ 2] > /usr/mpi/gcc/openmpi-trunk/lib/openmpi/mca_btl_openib.so > [0x2aaaaf0bdc27] > [vic12:04539] [ 3] > /usr/mpi/gcc/openmpi-trunk/lib/openmpi/mca_btl_openib.so > [0x2aaaaf0be07e] > [vic12:04539] [ 4] > /usr/mpi/gcc/openmpi-trunk/lib/openmpi/mca_btl_openib.so(mca_btl_openib_add_procs+0x857) > [0x2aaaaf0bd97c] > [vic12:04539] [ 5] > /usr/mpi/gcc/openmpi-trunk/lib/openmpi/mca_bml_r2.so(mca_bml_r2_add_procs+0x37d) > [0x2aaaaeeb399e] > [vic12:04539] [ 6] > /usr/mpi/gcc/openmpi-trunk/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0x15c) > [0x2aaaaec9036b] > [vic12:04539] [ 7] > /usr/mpi/gcc/openmpi-trunk/lib64/libmpi.so.0(ompi_mpi_init+0xb2b) > [0x2aaaaab03817] > [vic12:04539] [ 8] > /usr/mpi/gcc/openmpi-trunk/lib64/libmpi.so.0(MPI_Init+0x15d) > [0x2aaaaab44dc9] > [vic12:04539] [ 9] > /usr/mpi/gcc/openmpi-trunk/tests/IMB-2.3/IMB-MPI1(main+0x29) [0x402df9] > [vic12:04539] [10] /lib64/libc.so.6(__libc_start_main+0xf4) > [0x3a7d01d8a4] > [vic12:04539] [11] /usr/mpi/gcc/openmpi-trunk/tests/IMB-2.3/IMB-MPI1 > [0x402d39] > [vic12:04539] *** End of error message *** > -------------------------------------------------------------------------- > mpirun has exited due to process rank 1 with PID 4339 on > node vic20 calling "abort". This will have caused other processes > in the application to be terminated by signals sent by mpirun > (as reported here). > -------------------------------------------------------------------------- > > I am not having any problems running this test with the openib btl on > the ompi-1.2 branch, and I can run this test successfully with the udapl > and tcp btls on ompi-trunk. Is anyone else seeing this problem?
To answer my own question (with help from Jeff), the problem is caused by OMPI trying to use the iwarp interfaces (which currently do not work in openib). In previous versions, it only tried the IB interfaces (which do work). When I limited my tests to only use the IB interface, the test ran successfully. For the lazy, # mpirun --n 2 --host vic12,vic20 -mca btl openib,self --mca btl_openib_if_include mthca0 /usr/mpi/gcc/openmpi-trunk/tests/IMB-2.3/IMB-MPI1 pingpong Much thanks to Jeff, Jon > > Thanks, > Jon > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users