RNR , receive is not ready - It means that on recv side MPI don't have
buffers to get the data.
It may point to some broken configuration in MPI/ofud or credit leak in
OFUD code.
Åke Sandgren wrote:
Hi!
I'm having problem with getting the "error polling LP CQ with status
RNR..." on an otherwise completely empty system.
There are no errors visible in the error counters in any of the HCAs or
switches or anywhere else.
I'm running OMPI 1.3.2 built with pathscale 3.2
If i add -mca btl 'ofud,self,sm' the same code works ok.
It usually only shows up on runs with nodes=16:ppn=8 or higher, i.e. 8x8
works ok.
This might very well be a pathscale problem since when running with the
debug version of ompi 1.3.2 the problem goes away.
Complete error is:
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR
status number 13 for wr_id 465284992 opcode -1 vendor error 135 qp_idx
0
Any ideas to where in the ompi code i should start reducing optimization
levels to pinpoint this?
I'll try some more tests tomorrow with a hopefully fresh mind...