Re: [OMPI users] RETRY EXCEEDED ERROR status number 12

2009-08-21 Thread Pavel Shamis (Pasha)
You may try to use ibdiagnet tool: http://linux.die.net/man/1/ibdiagnet The tool is part of OFED (http://www.openfabrics.org/) Pasha. Prentice Bisbal wrote: Several jobs on my cluster just died with the error below. Are there any IB/Open MPI diagnostics I should use to diagnose, should I just

[OMPI users] RETRY EXCEEDED ERROR status number 12

2009-08-21 Thread Prentice Bisbal
Several jobs on my cluster just died with the error below. Are there any IB/Open MPI diagnostics I should use to diagnose, should I just reboot the nodes, or should I have the user who submitted these jobs just increase the retry count/timeout paramters? [0,1,6][../../../../../ompi/mca/btl/openi

Re: [OMPI users] RETRY EXCEEDED ERROR

2009-03-05 Thread Pavel Shamis (Pasha)
Thanks Pasha! ibdiagnet reports the following: -I--- -I- IPoIB Subnets Check -I--- -I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps SL:0x00 -W- Port localhost/P1 lid=0x00e2 guid=

Re: [OMPI users] RETRY EXCEEDED ERROR

2009-03-05 Thread Jan Lindheim
On Thu, Mar 05, 2009 at 10:27:27AM +0200, Pavel Shamis (Pasha) wrote: > > >Time to dig up diagnostics tools and look at port statistics. > > > You may use ibdiagnet tool for the network debug - > *http://linux.die.net/man/1/ibdiagnet. *This tool is part of OFED. > > Pasha. > __

Re: [OMPI users] RETRY EXCEEDED ERROR

2009-03-05 Thread Pavel Shamis (Pasha)
Time to dig up diagnostics tools and look at port statistics. You may use ibdiagnet tool for the network debug - *http://linux.die.net/man/1/ibdiagnet. *This tool is part of OFED. Pasha.

Re: [OMPI users] RETRY EXCEEDED ERROR

2009-03-04 Thread Jan Lindheim
On Wed, Mar 04, 2009 at 04:34:49PM -0500, Jeff Squyres wrote: > On Mar 4, 2009, at 4:16 PM, Jan Lindheim wrote: > > >On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote: > >> This *usually* indicates a physical / layer 0 problem in your IB > >> fabric. You should do a diagnostic on your

Re: [OMPI users] RETRY EXCEEDED ERROR

2009-03-04 Thread Jeff Squyres
On Mar 4, 2009, at 4:16 PM, Jan Lindheim wrote: On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote: > This *usually* indicates a physical / layer 0 problem in your IB > fabric. You should do a diagnostic on your HCAs, cables, and switches. > > Increasing the timeout value should on

Re: [OMPI users] RETRY EXCEEDED ERROR

2009-03-04 Thread Jan Lindheim
On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote: > This *usually* indicates a physical / layer 0 problem in your IB > fabric. You should do a diagnostic on your HCAs, cables, and switches. > > Increasing the timeout value should only be necessary on very large IB > fabrics and/or

Re: [OMPI users] RETRY EXCEEDED ERROR

2009-03-04 Thread Jeff Squyres
This *usually* indicates a physical / layer 0 problem in your IB fabric. You should do a diagnostic on your HCAs, cables, and switches. Increasing the timeout value should only be necessary on very large IB fabrics and/or very congested networks. On Mar 4, 2009, at 3:28 PM, Jan Lindheim w

[OMPI users] RETRY EXCEEDED ERROR

2009-03-04 Thread Jan Lindheim
I found several reports on the openmpi users mailing list from users, who need to bump up the default value for btl_openib_ib_timeout. We also have some applications on our cluster, that have problems, unless we set this value from the default 10 to 15: [24426,1],122][btl_openib_component.c:2905: