I have a program which uses MPI::Comm::Spawn to start processes on
compute nodes (c0-0, c0-1, etc).  The communication between the
compute nodes consists of ISend and IRecv pairs, while communication
between the compute nodes consists of gather and bcast operations.
After executing ~80 successful loops (gather/bcast pairs), I get this
error message from the head node process during a gather call:

[0,1,0][btl_openib_component.c:1332:btl_openib_component_progress]
from headnode.local to: c0-0 error polling HP CQ with status WORK
REQUEST FLUSHED ERROR status number 5 for wr_id 18504944 opcode 1

The relevant environment variables:
OMPI_MCA_btl_openib_rd_num=128
OMPI_MCA_btl_openib_verbose=1
OMPI_MCA_btl_base_verbose=1
OMPI_MCA_btl_openib_rd_low=75
OMPI_MCA_btl_base_debug=1
OMPI_MCA_btl_openib_warn_no_hca_params_found=0
OMPI_MCA_btl_openib_warn_default_gid_prefix=0
OMPI_MCA_btl=self,openib

If rd_low and rd_num are left at their default values, the program
simply hangs in the gather call after about 20 iterations (a gather
and a bcast).

Can anyone shed any light on what this error message means or what
might be done about it?

Thanks,
mch

Reply via email to