Brock Palen wrote:
On Nov 27, 2007, at 10:49 AM, Andrew Friedley wrote:
Brock Palen wrote:
On Nov 21, 2007, at 3:39 PM, Andrew Friedley wrote:
If this is what I think it is, try using this MCA parameter:
-mca btl_openib_ib_timeout 20
The user used this option and it allowed the run to complete.
You say its a issue with the fabric ibshowerrors does not show any
problems.
Its topspin (cisco) gear, nic's, switch,cables.
Should I follow up with cisco more?
Sure why not, if you think it'd be useful. FWIW, I see this on
Voltaire/Mellanox hardware with Open MPI; others here at LLNL tell me
they've seen it with MVAPICH as well.
What would be a place to look? Should this just be default then for
OMPI? ompi_info shows the default as 10 seconds? Is that right
'seconds' ?
The other IB guys can probably answer better than I can -- I'm not an
expert in this part of IB (or really any part I guess :). Not sure why
a larger value isn't the default. No, its not seconds -- check the
description of the MCA parameter:
4.096 microseconds * (2^btl_openib_ib_timeout)
Andrew