[OMPI users] RETRY EXCEEDED ERROR
I found several reports on the openmpi users mailing list from users, who need to bump up the default value for btl_openib_ib_timeout. We also have some applications on our cluster, that have problems, unless we set this value from the default 10 to 15: [24426,1],122][btl_openib_component.c:2905:handle_wc] from shc174 to: shc175 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 250450816 opcode 11048 qp_idx 3 This is seen with OpenMPI 1.3 and OpenFabrics 1.4. Is this normal or is it an indicator of other problems, maybe related to hardware? Are there other parameters that need to be looked at too? Thanks for any insight on this! Regards, Jan Lindheim
Re: [OMPI users] RETRY EXCEEDED ERROR
On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote: > This *usually* indicates a physical / layer 0 problem in your IB > fabric. You should do a diagnostic on your HCAs, cables, and switches. > > Increasing the timeout value should only be necessary on very large IB > fabrics and/or very congested networks. Thanks Jeff! What is considered to be very large IB fabrics? I assume that with just over 180 compute nodes, our cluster does not fall into this category. Jan > > > On Mar 4, 2009, at 3:28 PM, Jan Lindheim wrote: > > >I found several reports on the openmpi users mailing list from users, > >who need to bump up the default value for btl_openib_ib_timeout. > >We also have some applications on our cluster, that have problems, > >unless we set this value from the default 10 to 15: > > > >[24426,1],122][btl_openib_component.c:2905:handle_wc] from shc174 > >to: shc175 > >error polling LP CQ with status RETRY EXCEEDED ERROR status number > >12 for > >wr_id 250450816 opcode 11048 qp_idx 3 > > > >This is seen with OpenMPI 1.3 and OpenFabrics 1.4. > > > >Is this normal or is it an indicator of other problems, maybe > >related to > >hardware? > >Are there other parameters that need to be looked at too? > > > >Thanks for any insight on this! > > > >Regards, > >Jan Lindheim > >___ > >users mailing list > >us...@open-mpi.org > >http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > Cisco Systems > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] RETRY EXCEEDED ERROR
On Wed, Mar 04, 2009 at 04:34:49PM -0500, Jeff Squyres wrote: > On Mar 4, 2009, at 4:16 PM, Jan Lindheim wrote: > > >On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote: > >> This *usually* indicates a physical / layer 0 problem in your IB > >> fabric. You should do a diagnostic on your HCAs, cables, and > >switches. > >> > >> Increasing the timeout value should only be necessary on very > >large IB > >> fabrics and/or very congested networks. > > > >Thanks Jeff! > >What is considered to be very large IB fabrics? > >I assume that with just over 180 compute nodes, > >our cluster does not fall into this category. > > > > I was a little misleading in my note -- I should clarify. It's really > congestion that matters, not the size of the fabric. Congestion is > potentially more likely to happen in larger fabrics, since packets may > have to flow through more switches, there's likely more apps running > on the cluster, etc. But it's all very application/cluster-specific; > only you can know if your fabric is heavily congested based on what > you run on it, etc. > > -- > Jeff Squyres > Cisco Systems > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > Thanks again Jeff! Time to dig up diagnostics tools and look at port statistics. Jan
Re: [OMPI users] RETRY EXCEEDED ERROR
On Thu, Mar 05, 2009 at 10:27:27AM +0200, Pavel Shamis (Pasha) wrote: > > >Time to dig up diagnostics tools and look at port statistics. > > > You may use ibdiagnet tool for the network debug - > *http://linux.die.net/man/1/ibdiagnet. *This tool is part of OFED. > > Pasha. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > Thanks Pasha! ibdiagnet reports the following: -I--- -I- IPoIB Subnets Check -I--- -I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps SL:0x00 -W- Port localhost/P1 lid=0x00e2 guid=0x001e0b4ced75 dev=25218 can not join due to rate:2.5Gbps < group:10Gbps I guess this may indicate a bad adapter. Now, I just need to find what system this maps to. I also ran ibcheckerrors and it reports a lot of problems with buffer overruns. Here's the tail end of the output, with only some of the last ports reported: #warn: counter SymbolErrors = 36905 (threshold 10) lid 193 port 14 #warn: counter LinkDowned = 23 (threshold 10) lid 193 port 14 #warn: counter RcvErrors = 15641(threshold 10) lid 193 port 14 #warn: counter RcvSwRelayErrors = 225 (threshold 100) lid 193 port 14 #warn: counter ExcBufOverrunErrors = 10 (threshold 10) lid 193 port 14 Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 14: FAILED #warn: counter LinkRecovers = 181 (threshold 10) lid 193 port 1 #warn: counter RcvSwRelayErrors = 2417 (threshold 100) lid 193 port 1 Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 1: FAILED #warn: counter LinkRecovers = 103 (threshold 10) lid 193 port 3 #warn: counter RcvErrors = 9035 (threshold 10) lid 193 port 3 #warn: counter RcvSwRelayErrors = 64670 (threshold 100) lid 193 port 3 Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 3: FAILED #warn: counter SymbolErrors = 13151 (threshold 10) lid 193 port 4 #warn: counter RcvErrors = 109 (threshold 10) lid 193 port 4 #warn: counter RcvSwRelayErrors = 507 (threshold 100) lid 193 port 4 Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 4: FAILED ## Summary: 209 nodes checked, 0 bad nodes found ## 716 ports checked, 103 ports have errors beyond threshold I wonder if this is something that needs to be tuned in the Infiniband switch or if there is something in OpenMPI/OpenIB that can be tuned. Thanks, Jan Lindheim