Dear all,

I would like to ask for help with understanding an error message I get
when communication using Open MPI 1.4.1 over Infiniband fails. After
several hours of operation, communication with one particular node
(f24) fails with something like:

[[20265,1],79][btl_openib_component.c:2951:handle_wc] from f05 to: f24
error polling LP CQ with status INVALID REQUEST ERROR status number 9
for wr_id 309134592 opcode 1  vendor error 138 qp_idx 2
[[20265,1],39][btl_openib_component.c:2951:handle_wc] from f24 to: f05
error polling LP CQ with status WORK REQUEST FLUSHED ERROR status
number 5 for wr_id 313731584 opcode 1  vendor error 249 qp_idx 2

This is reproducible in the sense that it happens repeatedly, but so
far I was not able to create a test case that would trigger the
problem. It happens after hours of smooth operation. One of the nodes
involved is always f24. When I leave it out of the job, I get stable a
run with no trouble. Is this a hardware error or something else? Is
there something I can do try to locate the problem better? Where can I
find what the error codes mean?

Thanks,
Ondrej Marsalek

Reply via email to