We have a user whos code keep failing at a similar point in the code. The errors (below) would make me think its a fabric problem, but ibcheckerrors is not returning any issues. He is using openmpi-1.2.0 With OFED on RHEL4,

Far field AIM propagators require(MB):    1.441955566406250
Arranging Communication structures for iterative solver...
Iteratively solving for incidence,frequency            1            1
[0,1,28][btl_openib_component.c:1199:btl_openib_component_progress] from nyx452.engin.umich.edu to: nyx440.engin.umich.edu error polling HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 57567152 opcode 0 [0,1,9][btl_openib_component.c:1199:btl_openib_component_progress] from nyx457.engin.umich.edu to: nyx439.engin.umich.edu error polling HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 59310768 opcode 0 [0,1,24][btl_openib_component.c:1199:btl_openib_component_progress] from nyx453.engin.umich.edu to: nyx439.engin.umich.edu error polling HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 243416944 opcode 0 [0,1,55][btl_openib_component.c:1199:btl_openib_component_progress] from nyx446.engin.umich.edu to: nyx439.engin.umich.edu error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 54465584 opcode 0 [0,1,60][btl_openib_component.c:1199:btl_openib_component_progress] from nyx444.engin.umich.edu

The errors for the other jobs die at the same point with similar messages but different hosts. They do all share the same IB switch.

Pointers?


Brock Palen
Center for Advanced Computing
bro...@umich.edu
(734)936-1985


Reply via email to