Dear OMPI list, I'm running into a problem with openmpi 1.2 where a MPI program is crashing with
local QP operation err (QPN 380404, WQE @ 00000583, CQN 040085, index 1147949) [ 0] 00380404 [ 4] 00000000 [ 8] 00000000 [ c] 00000000 [10] 026f0000 [14] 00000000 [18] 00000583 [1c] ff000000 [0,1,0][btl_openib_component.c:1195:btl_openib_component_progress] from n0001.yquem to: n0002.yquem error polling HP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id 42714736 opcode 0 Can someone interpret this for me or suggest how to obtain any more useful information? My guess is that the cause is running out of buffer space. If so is this a bug or limit in open-mpi? The machine is a dual 2.66GHz Xeon cluster with Infiniband. Some background: The error occurs in a test case I run widely for a large electronic structure code, and is in the routine that gathers a large quantity of data from all of the processors in the run into the root node to write an output file. Each processor MPI_Send()s a number of blocks of data to root, which MPI_Recv()s in nested loops over blocks and remote nodes. We have had problems in the past with the volume of data overwhelming other MPI implementations' buffer cache space during this step, and in response to this there is a synchronization step which causes the remote nodes to wait on a blocking recv for a "go ahead and send" message from root. Using this the number of data blocks (messages) sent at once can be controlled. With the default of 32 at once, running on 16 nodes (so with potentially 15x32 480 outstanding messages at a time) the crash occurs. Restricting the number of blocks/node to 16 (ie 240 pending messages) gives a successful run with no crash. Version 1.2 of openmpi seems better than 1.1.5 in this respect, which always crashes on the 16-node run even with only 1 message sent at once from each processor. For some reason ompi 1.1.5 gives a better traceback too.... local QP operation err (QPN 180408, WQE @ 00000703, CQN 140085, index 1309215) [ 0] 00180408 [ 4] 00000000 [ 8] 00000000 [ c] 00000000 [10] 026f0000 [14] 00000000 [18] 00000703 [1c] ff000000 [0,1,0][btl_openib_component.c:897:mca_btl_openib_component_progress] from n0001.yquem to: n0002.yquem error polling HP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id 40618448 opcode 0 Signal:6 info.si_errno:0(Success) si_code:-6() [0] func:/data/software/x86_64/open-mpi/1.1.5-intel/lib/libopal.so.0 [0x2a95fc404c] [1] func:/lib64/tls/libpthread.so.0 [0x2a95a12430] [2] func:/lib64/tls/libc.so.6(gsignal+0x3d) [0x2a965d421d] [3] func:/lib64/tls/libc.so.6(abort+0xfe) [0x2a965d5a1e] [4] func:/data/software/x86_64/open-mpi/1.1.5-intel/lib/libmpi.so.0(mca_btl_openib_component_progress+0x751) [0x2a95be09d3] [5] func:/data/software/x86_64/open-mpi/1.1.5-intel/lib/libmpi.so.0(mca_bml_r2_progress+0x3a) [0x2a95bd48fc] [6] func:/data/software/x86_64/open-mpi/1.1.5-intel/lib/libopal.so.0(opal_progress+0x80) [0x2a95faaa06] [7] func:/data/software/x86_64/open-mpi/1.1.5-intel/lib/libmpi.so.0(mca_pml_ob1_recv+0x329) [0x2a95c2e679] [8] func:/data/software/x86_64/open-mpi/1.1.5-intel/lib/libmpi.so.0(PMPI_Recv+0x22e) [0x2a95bbdbd2] [9] func:/data/software/x86_64/open-mpi/1.1.5-intel/lib/libmpi.so.0(pmpi_recv_+0xd9) [0x2a95bcfbdd] [10] func:/home/krefson/bin/castep-4.1b(comms_mp_comms_recv_integer_+0x45) [0x10e5ae9] ... I'd appreciate an opinion on whether the problem is in OpenMPI or not and what's the best way to proceed. Keith Refson