I've been having trouble using Open MPI with a medium-sized cluster:
This cluster has three fabrics: Gigabit Ethernet, 10G Myrinet MX, and
InfiniBand. Myrinet works great. IB and GigE have issues:
Using the 'openib' BTL (kernel 2.6.16.1 for drivers, openib.org RC4
userspace libraries & tools).This example uses the IMB benchmark, but the
problem is not limited to IMB
*********************************************************************
[root@zartan ~]# mpirun -np 90 -mca btl openib
-machinefile /etc/pdsh/machines /tmp/IMB-MPI1
#----------------------------------------------------------------
# Benchmarking Reduce
# #processes = 64
# ( 26 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.04 0.04 0.04
[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 12 for wr_id 47003518529948 opcode 0
[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003518767232 opcode 0
[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003518965544 opcode 0
[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003547253820 opcode 0
[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003547286872 opcode 0
[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003547319924 opcode 0
[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003547352976 opcode 0
[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003547386028 opcode 0
[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003547419080 opcode 0
[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003547452132 opcode 0
[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003549606016 opcode 0
[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003549639068 opcode 0
**********************************************************************
With TCP, I get the following error(s)
[root@zartan ~]# mpirun -np 90 -mca btl tcp
-machinefile /etc/pdsh/machines /tmp/IMB-MPI1
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x6
[0] func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/libopal.so.0
[0x2adefe5248ca]
[1] func:/lib64/libpthread.so.0 [0x2adefeb2e380]
[2]
func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_btl_tcp.so(mca_btl_tcp_proc_remove+0xbb)
[0x2adf018139ab]
[3]
func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_btl_tcp.so
[0x2adf01811bec]
[4]
func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_btl_tcp.so(mca_btl_tcp_add_procs+0x155)
[0x2adf0180f445]
[5]
func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_bml_r2.so(mca_bml_r2_add_procs+0x26b)
[0x2adf011912db]
[6]
func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xcc)
[0x2adf00f75d5c]
[7]
func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/libmpi.so.0(ompi_mpi_init
+0x590) [0x2adefe295c90]
[8] func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/libmpi.so.0(MPI_Init
+0x83) [0x2adefe2812d3]
[9] func:/tmp/IMB-MPI1(main+0x29) [0x402eb9]
[10] func:/lib64/libc.so.6(__libc_start_main+0xdc) [0x2adefec534cc]
[11] func:/tmp/IMB-MPI1 [0x402df9]
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x6
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x6
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x6
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x6
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x6
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x6
*** End of error message ***
4 additional processes aborted (not shown)
Any Thoughts/Ideas on how to fix it?
--
Troy Telford