What value do you suggest then? I know I've seen the problem persist at
values of 14 and 16, and would rather be certain that this isn't going
to kill the job that just sat in the queue for a week.
Andrew
Jeff Squyres wrote:
Roland thought that the default value of 10 might be a bit too low and
that tuning it to be higher, particularly in apps that pound on a
single port, would probably be acceptable.
Tuning up to 20 is probably a bit overkill.
On Nov 27, 2007, at 3:54 PM, Jeff Squyres wrote:
BTW, Andrew is correct about the unit for btl_openib_ib_timeout and
that the value is simply passed down to the verbs library when
making an IB connection. Open MPI does nothing else with that
value; it's an IBTA-defined value.
The help message was wrong on the 1.2 branch for a while; I think
it's been corrected in more recent versions of OMPI (i.e., >1.2 -- I
don't recall which version specifically).
On Nov 27, 2007, at 3:19 PM, Andrew Friedley wrote:
Brock Palen wrote:
What would be a place to look? Should this just be default then
for
OMPI? ompi_info shows the default as 10 seconds? Is that right
'seconds' ?
The other IB guys can probably answer better than I can -- I'm
not an
expert in this part of IB (or really any part I guess :). Not sure
why
a larger value isn't the default. No, its not seconds -- check the
description of the MCA parameter:
4.096 microseconds * (2^btl_openib_ib_timeout)
You sure?
ompi_info --param btl openib
MCA btl: parameter "btl_openib_ib_timeout" (current value: "10")
InfiniBand transmit timeout, in seconds
(must be >= 1)
Yeah:
MCA btl: parameter "btl_openib_ib_timeout" (current value: "10")
InfiniBand transmit timeout, plugged into formula:
4.096 microseconds * (2^btl_openib_ib_timeout)(must be
= 0 and <= 31)
Reading earlier in the thread you said OMPI v1.2.0, I got this from a
trunk checkout thats around 3 weeks old. A quick check shows this
description was changed between 1.2.0 and 1.2.1. However the use of
this parameter hasn't changed -- it's simply passed along to IB verbs
when creating a queue pair (aka a connection).
Andrew
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems