On May 4, 2009, at 9:50 AM, jan wrote:
Thank you Jeff. I have passed the mail to the IB vendor Dell
company(the
blade was ordered from Dell Taiwan), but he todl me that he didn't
understand "layer 0 diagnostics". Coluld you help us to get more
information of "layer 0 diagnostics". Thanks again.
Layer 0 = your physical network layer. Specifically: ensure that your
IB network is actually functioning properly at both the physical and
driver layer. Cisco was an IB vendor for several years; I can tell
you from experience that it is *not* enough to just plug everything in
and run a few trivial tests to ensure that network traffic seems to be
passed properly. You need to have your vendor run a full set of layer
0 diagnostics to ensure that all the cables are good, all the HCAs are
good, all the drivers are functioning properly, etc. This involves
running diagnostic network testing patterns, checking various error
counters on the HCAs and IB switches, etc.
This is something that Dell should know how to do.
I say all this because the problem that you are seeing *seems* to be a
network-related problem, not an OMPI-related problem. One can never
know for sure, but it is fairly clear that the very first step in your
case is to verify that the network is functioning 100% properly.
FWIW: this was standard operating procedure when Cisco was selling IB
hardware.
--
Jeff Squyres
Cisco Systems