Brock Palen wrote:
On Nov 27, 2007, at 10:49 AM, Andrew Friedley wrote:
Brock Palen wrote:
On Nov 21, 2007, at 3:39 PM, Andrew Friedley wrote:

If this is what I think it is, try using this MCA parameter:

-mca btl_openib_ib_timeout 20
The user used this option and it allowed the run to complete.
You say its a issue with the fabric ibshowerrors does not show any
problems.

Its topspin (cisco) gear, nic's, switch,cables.
Should I follow up with cisco more?
Sure why not, if you think it'd be useful.  FWIW, I see this on
Voltaire/Mellanox hardware with Open MPI; others here at LLNL tell me
they've seen it with MVAPICH as well.

What would be a place to look? Should this just be default then for OMPI? ompi_info shows the default as 10 seconds? Is that right 'seconds' ?

The other IB guys can probably answer better than I can -- I'm not an expert in this part of IB (or really any part I guess :). Not sure why a larger value isn't the default. No, its not seconds -- check the description of the MCA parameter:

4.096 microseconds * (2^btl_openib_ib_timeout)

Andrew

Reply via email to