We have a user whos code keep failing at a similar point in the
code. The errors (below) would make me think its a fabric problem,
but ibcheckerrors is not returning any issues. He is using
openmpi-1.2.0 With OFED on RHEL4,
Far field AIM propagators require(MB): 1.441955566406250
Arranging Communication structures for iterative solver...
Iteratively solving for incidence,frequency 1 1
[0,1,28][btl_openib_component.c:1199:btl_openib_component_progress]
from nyx452.engin.umich.edu to: nyx440.engin.umich.edu error polling
HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id
57567152 opcode 0
[0,1,9][btl_openib_component.c:1199:btl_openib_component_progress]
from nyx457.engin.umich.edu to: nyx439.engin.umich.edu error polling
HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id
59310768 opcode 0
[0,1,24][btl_openib_component.c:1199:btl_openib_component_progress]
from nyx453.engin.umich.edu to: nyx439.engin.umich.edu error polling
HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id
243416944 opcode 0
[0,1,55][btl_openib_component.c:1199:btl_openib_component_progress]
from nyx446.engin.umich.edu to: nyx439.engin.umich.edu error polling
LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id
54465584 opcode 0
[0,1,60][btl_openib_component.c:1199:btl_openib_component_progress]
from nyx444.engin.umich.edu
The errors for the other jobs die at the same point with similar
messages but different hosts. They do all share the same IB switch.
Pointers?
Brock Palen
Center for Advanced Computing
bro...@umich.edu
(734)936-1985