I have a program which uses MPI::Comm::Spawn to start processes on compute nodes (c0-0, c0-1, etc). The communication between the compute nodes consists of ISend and IRecv pairs, while communication between the compute nodes consists of gather and bcast operations. After executing ~80 successful loops (gather/bcast pairs), I get this error message from the head node process during a gather call:
[0,1,0][btl_openib_component.c:1332:btl_openib_component_progress] from headnode.local to: c0-0 error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 18504944 opcode 1 The relevant environment variables: OMPI_MCA_btl_openib_rd_num=128 OMPI_MCA_btl_openib_verbose=1 OMPI_MCA_btl_base_verbose=1 OMPI_MCA_btl_openib_rd_low=75 OMPI_MCA_btl_base_debug=1 OMPI_MCA_btl_openib_warn_no_hca_params_found=0 OMPI_MCA_btl_openib_warn_default_gid_prefix=0 OMPI_MCA_btl=self,openib If rd_low and rd_num are left at their default values, the program simply hangs in the gather call after about 20 iterations (a gather and a bcast). Can anyone shed any light on what this error message means or what might be done about it? Thanks, mch