I've been having trouble using Open MPI with a medium-sized cluster:

This cluster has three fabrics: Gigabit Ethernet, 10G Myrinet MX, and InfiniBand. Myrinet works great. IB and GigE have issues:


Using the 'openib' BTL (kernel 2.6.16.1 for drivers, openib.org RC4 userspace libraries & tools).This example uses the IMB benchmark, but the problem is not limited to IMB

*********************************************************************
[root@zartan ~]# mpirun -np 90 -mca btl openib
-machinefile /etc/pdsh/machines /tmp/IMB-MPI1

#----------------------------------------------------------------
# Benchmarking Reduce
# #processes = 64
# ( 26 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.04         0.04         0.04
[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 12 for wr_id 47003518529948 opcode 0

[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003518767232 opcode 0

[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003518965544 opcode 0

[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003547253820 opcode 0

[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003547286872 opcode 0

[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003547319924 opcode 0

[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003547352976 opcode 0

[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003547386028 opcode 0

[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003547419080 opcode 0

[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003547452132 opcode 0

[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003549606016 opcode 0

[0,1,63][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47003549639068 opcode 0

**********************************************************************

With TCP, I get the following error(s)

[root@zartan ~]# mpirun -np 90 -mca btl tcp
-machinefile /etc/pdsh/machines /tmp/IMB-MPI1

Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x6
[0] func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/libopal.so.0
[0x2adefe5248ca]
[1] func:/lib64/libpthread.so.0 [0x2adefeb2e380]
[2]
func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_btl_tcp.so(mca_btl_tcp_proc_remove+0xbb) [0x2adf018139ab]
[3]
func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_btl_tcp.so
[0x2adf01811bec]
[4]
func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_btl_tcp.so(mca_btl_tcp_add_procs+0x155) [0x2adf0180f445]
[5]
func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_bml_r2.so(mca_bml_r2_add_procs+0x26b) [0x2adf011912db]
[6]
func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xcc) [0x2adf00f75d5c]
[7]
func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/libmpi.so.0(ompi_mpi_init
+0x590) [0x2adefe295c90]
[8] func:/usr/x86_64-gcc-4.0.0/openmpi-1.0.2/lib64/libmpi.so.0(MPI_Init
+0x83) [0x2adefe2812d3]
[9] func:/tmp/IMB-MPI1(main+0x29) [0x402eb9]
[10] func:/lib64/libc.so.6(__libc_start_main+0xdc) [0x2adefec534cc]
[11] func:/tmp/IMB-MPI1 [0x402df9]
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x6
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x6
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x6
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x6
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x6
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x6
*** End of error message ***
4 additional processes aborted (not shown)

Any Thoughts/Ideas on how to fix it?
--
 Troy Telford

Reply via email to