On Dec 6, 2006, at 2:29 PM, Brock Palen wrote:
I wonder if we can narrow this down a bit to perhaps a PML protocol
issue.
Start by disabling RDMA by using:
-mca btl_gm_flags 1
On the other-hand, with OB1 using btl_gm_flags 1 fixed the error
problem with OMPI! Which is a great first step.
mpirun -np 4 --mca btl_gm_flags 1 ./xhpl
Allowed HPL to run with no errors. I verified the performance was
better than when ran without gm
(added --mca btl ^gm )
So still a problem with DR which i dont need but im willing to help
test it.
Scott,
Can we look into why leaving RDMA on if causing a problem?
Brock
Brock and Galen,
We are willing to assist. Our best guess is that OMPI is using the
code in a way different than MPICH-GM does. One of our other
developers who is more comfortable with the GM API is looking into it.
Testing with HPCC, in addition to the HPL failed residuals, I am also
seeing these messages:
[3]: ERROR: from right: expected 2 and 3 as first and last byte, but
got 2 and 5 instead
[3]: ERROR: from right: expected 3 and 4 as first and last byte, but
got 3 and 7 instead
[1]: ERROR: from right: expected 4 and 5 as first and last byte, but
got 4 and 3 instead
[1]: ERROR: from right: expected 7 and 8 as first and last byte, but
got 7 and 5 instead
which is from $HPCC/src/bench_lat_bw_1.5.2.c.
Scott