Re: [OMPI users] openib RETRY EXCEEDED ERROR

2009-03-04 Thread Jeff Squyres
On Mar 1, 2009, at 7:24 PM, Brett Pemberton wrote: I'd appreciate some advice on if I'm using OFED correctly. I'm running OFED 1.4, however not the kernel modules, just userland. Is this a bad idea? I believe so. I'm not a kernel guy, but I've always used the userland bits matched with th

Re: [OMPI users] openib RETRY EXCEEDED ERROR

2009-03-01 Thread Brett Pemberton
Matt Hughes wrote: 2009/2/26 Brett Pemberton : [[1176,1],0][btl_openib_component.c:2905:handle_wc] from tango092.vpac.org to: tango090 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 38996224 opcode 0 qp_idx 0 What OS are you using? Centos 5 I've seen this

Re: [OMPI users] openib RETRY EXCEEDED ERROR

2009-02-27 Thread Jeff Squyres
On Feb 27, 2009, at 12:09 PM, Åke Sandgren wrote: We see these errors fairly frequently on our CentOS 5.2 system with Mellanox InfiniHost III cards. The OFED stack is whatever the CentOS5.2 uses. Has anyone tested that with the 1.4 OFED stack? FWIW, I have tested OMPI's openib BTL with sev

Re: [OMPI users] openib RETRY EXCEEDED ERROR

2009-02-27 Thread Pavel Shamis (Pasha)
Usually "retry exceeded error" points to some network issues, like bad cable or some bad connector. You may use ibdiagnet tool for the network debug - *http://linux.die.net/man/1/ibdiagnet. *This tool is part of OFED. Pasha Brett Pemberton wrote: Hey, I've had a couple of errors recently, of

Re: [OMPI users] openib RETRY EXCEEDED ERROR

2009-02-27 Thread Åke Sandgren
On Fri, 2009-02-27 at 09:54 -0700, Matt Hughes wrote: > 2009/2/26 Brett Pemberton : > > [[1176,1],0][btl_openib_component.c:2905:handle_wc] from tango092.vpac.org > > to: tango090 error polling LP CQ with status RETRY EXCEEDED ERROR status > > number 12 for wr_id 38996224 opcode 0 qp_idx 0 > > Wha

Re: [OMPI users] openib RETRY EXCEEDED ERROR

2009-02-27 Thread Matt Hughes
2009/2/26 Brett Pemberton : > [[1176,1],0][btl_openib_component.c:2905:handle_wc] from tango092.vpac.org > to: tango090 error polling LP CQ with status RETRY EXCEEDED ERROR status > number 12 for wr_id 38996224 opcode 0 qp_idx 0 What OS are you using? I've seen this error and many other Infiniban

Re: [OMPI users] openib RETRY EXCEEDED ERROR

2009-02-27 Thread Biagio Lucini
Bogdan Costescu wrote: Brett Pemberton wrote: [[1176,1],0][btl_openib_component.c:2905:handle_wc] from tango092.vpac.org to: tango090 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 38996224 opcode 0 qp_idx 0 I've seen this error with Mellanox ConnectX cards

Re: [OMPI users] openib RETRY EXCEEDED ERROR

2009-02-27 Thread Bogdan Costescu
Brett Pemberton wrote: [[1176,1],0][btl_openib_component.c:2905:handle_wc] from tango092.vpac.org to: tango090 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 38996224 opcode 0 qp_idx 0 I've seen this error with Mellanox ConnectX cards and OFED 1.2.x with al

[OMPI users] openib RETRY EXCEEDED ERROR

2009-02-26 Thread Brett Pemberton
Hey, I've had a couple of errors recently, of the form: [[1176,1],0][btl_openib_component.c:2905:handle_wc] from tango092.vpac.org to: tango090 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 38996224 opcode 0 qp_idx 0 --