For what it's worth Andrew, the RETRY_EXCEEDED_ERRORS can be caused by
flaky hardware as well.  The timeout value is probably best tuned
relative to the size of your IB fabric.  But if reliability is the
biggest criteria, crank up the timemout value to 20.  That's the best
you can do.  If it continues to happen, it is more than likely you have
a flaky HCA, IB link, switch side sw, or node.  We actually have way too
much IB hardware for any sane person and my experience is that the
RETRY_EXCEEDED_ERRORS can sometimes be really tricky to track down.  One
of my favorites is the spontaneous rebooting node.  We see nodes under
heavy MPI application load sometimes randomly reboot.  This causes the
RETRY_EXCEEDED_ERROR as well.  I would second the recommendation to
watch the IB counters across the entire IB fabric from the subnet
manager.

Good luck!

> -----Original Message-----
> From: users-boun...@open-mpi.org 
> [mailto:users-boun...@open-mpi.org] On Behalf Of Andrew Friedley
> Sent: Wednesday, November 28, 2007 9:36 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenIB problems
> 
> What value do you suggest then?  I know I've seen the problem 
> persist at 
> values of 14 and 16, and would rather be certain that this 
> isn't going 
> to kill the job that just sat in the queue for a week.
> 
> Andrew
> 
> Jeff Squyres wrote:
> > Roland thought that the default value of 10 might be a bit 
> too low and  
> > that tuning it to be higher, particularly in apps that pound on a  
> > single port, would probably be acceptable.
> > 
> > Tuning up to 20 is probably a bit overkill.
> > 
> > 


Reply via email to