For what it's worth Andrew, the RETRY_EXCEEDED_ERRORS can be caused by flaky hardware as well. The timeout value is probably best tuned relative to the size of your IB fabric. But if reliability is the biggest criteria, crank up the timemout value to 20. That's the best you can do. If it continues to happen, it is more than likely you have a flaky HCA, IB link, switch side sw, or node. We actually have way too much IB hardware for any sane person and my experience is that the RETRY_EXCEEDED_ERRORS can sometimes be really tricky to track down. One of my favorites is the spontaneous rebooting node. We see nodes under heavy MPI application load sometimes randomly reboot. This causes the RETRY_EXCEEDED_ERROR as well. I would second the recommendation to watch the IB counters across the entire IB fabric from the subnet manager.
Good luck! > -----Original Message----- > From: users-boun...@open-mpi.org > [mailto:users-boun...@open-mpi.org] On Behalf Of Andrew Friedley > Sent: Wednesday, November 28, 2007 9:36 AM > To: Open MPI Users > Subject: Re: [OMPI users] OpenIB problems > > What value do you suggest then? I know I've seen the problem > persist at > values of 14 and 16, and would rather be certain that this > isn't going > to kill the job that just sat in the queue for a week. > > Andrew > > Jeff Squyres wrote: > > Roland thought that the default value of 10 might be a bit > too low and > > that tuning it to be higher, particularly in apps that pound on a > > single port, would probably be acceptable. > > > > Tuning up to 20 is probably a bit overkill. > > > >