On Thu, 2009-05-14 at 09:24 -0400, Jeff Squyres wrote: > On May 13, 2009, at 4:55 PM, Åke Sandgren wrote: > > > I'm having problem with getting the "error polling LP CQ with status > > RNR..." on an otherwise completely empty system. > > There are no errors visible in the error counters in any of the HCAs > > or > > switches or anywhere else. > > > > I'm running OMPI 1.3.2 built with pathscale 3.2 > > > > If i add -mca btl 'ofud,self,sm' the same code works ok. > > > > Interesting. I have only done very limited testing with ofud; are you > saying that you get these errors if you "--mca btl openib,sm,self"?
I think i have tested it but at the moment i'm not sure. I will do more tests later. (Busy doing firmware upgrades...) > > It usually only shows up on runs with nodes=16:ppn=8 or higher, i.e. > > 8x8 > > works ok. > > > > This might very well be a pathscale problem since when running with > > the > > debug version of ompi 1.3.2 the problem goes away. > > > > Complete error is: > > error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED > > ERROR > > status number 13 for wr_id 465284992 opcode -1 vendor error 135 > > qp_idx > > 0 > > > > Any ideas to where in the ompi code i should start reducing > > optimization > > levels to pinpoint this? > > > > > Do you have a simple reproducer test case, perchance? Unfortunately no. Have only seen this reproducibly on large jobs. -- Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden Internet: a...@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90 7866126 Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se