Dear OMPI list,

I'm running into a problem with openmpi 1.2 where a MPI program is crashing with

local QP operation err (QPN 380404, WQE @ 00000583, CQN 040085, index 1147949)
  [ 0] 00380404
  [ 4] 00000000
  [ 8] 00000000
  [ c] 00000000
  [10] 026f0000
  [14] 00000000
  [18] 00000583
  [1c] ff000000
[0,1,0][btl_openib_component.c:1195:btl_openib_component_progress] from 
n0001.yquem to: n0002.yquem
error polling HP CQ with status LOCAL QP OPERATION ERROR status number 2 for 
wr_id 42714736 opcode 0

Can someone interpret this for me or suggest how to obtain any more
useful information?  My guess is that the cause is running out of
buffer space.  If so is this a bug or limit in open-mpi?

The machine is a dual 2.66GHz Xeon cluster with Infiniband.

Some background: The error occurs in a test case I run widely for a
large electronic structure code, and is in the routine that gathers a
large quantity of data from all of the processors in the run into the
root node to write an output file.  Each processor MPI_Send()s a
number of blocks of data to root, which MPI_Recv()s in nested loops
over blocks and remote nodes.

We have had problems in the past with the volume of data overwhelming
other MPI implementations' buffer cache space during this step, and in
response to this there is a synchronization step which causes the
remote nodes to wait on a blocking recv for a "go ahead and send"
message from root.  Using this the number of data blocks (messages)
sent at once can be controlled.

With the default of 32 at once, running on 16 nodes (so with
potentially 15x32 480 outstanding messages at a time) the crash
occurs.  Restricting the number of blocks/node to 16 (ie 240
pending messages) gives a successful run with no crash.

Version 1.2 of openmpi seems better than 1.1.5 in this respect, which
always crashes on the 16-node run even with only 1 message sent at
once from each processor.  For some reason ompi 1.1.5 gives a better
traceback too....

local QP operation err (QPN 180408, WQE @ 00000703, CQN 140085, index 1309215)
  [ 0] 00180408
  [ 4] 00000000
  [ 8] 00000000
  [ c] 00000000
  [10] 026f0000
  [14] 00000000
  [18] 00000703
  [1c] ff000000
[0,1,0][btl_openib_component.c:897:mca_btl_openib_component_progress] from 
n0001.yquem to:
n0002.yquem error polling HP CQ with status LOCAL QP OPERATION ERROR status 
number 2 for wr_id
40618448 opcode 0
Signal:6 info.si_errno:0(Success) si_code:-6()
[0] func:/data/software/x86_64/open-mpi/1.1.5-intel/lib/libopal.so.0 
[0x2a95fc404c]
[1] func:/lib64/tls/libpthread.so.0 [0x2a95a12430]
[2] func:/lib64/tls/libc.so.6(gsignal+0x3d) [0x2a965d421d]
[3] func:/lib64/tls/libc.so.6(abort+0xfe) [0x2a965d5a1e]
[4]
func:/data/software/x86_64/open-mpi/1.1.5-intel/lib/libmpi.so.0(mca_btl_openib_component_progress+0x751)
[0x2a95be09d3]
[5] 
func:/data/software/x86_64/open-mpi/1.1.5-intel/lib/libmpi.so.0(mca_bml_r2_progress+0x3a)
[0x2a95bd48fc]
[6] 
func:/data/software/x86_64/open-mpi/1.1.5-intel/lib/libopal.so.0(opal_progress+0x80)
 [0x2a95faaa06]
[7] 
func:/data/software/x86_64/open-mpi/1.1.5-intel/lib/libmpi.so.0(mca_pml_ob1_recv+0x329)
[0x2a95c2e679]
[8] 
func:/data/software/x86_64/open-mpi/1.1.5-intel/lib/libmpi.so.0(PMPI_Recv+0x22e)
 [0x2a95bbdbd2]
[9] 
func:/data/software/x86_64/open-mpi/1.1.5-intel/lib/libmpi.so.0(pmpi_recv_+0xd9)
 [0x2a95bcfbdd]
[10] func:/home/krefson/bin/castep-4.1b(comms_mp_comms_recv_integer_+0x45) 
[0x10e5ae9]
...



I'd appreciate an opinion on whether the problem is in OpenMPI or not and
what's the best way to proceed.

Keith Refson

Reply via email to