George,
Using DR was suggested to see if it could find an error. The original
problem was using OB1, and HPL gave failed residuals. The hope was
that DR would pinpoint any problems. It did not and HPL did not
progress at all (the GM counters incremented, but no tests were
completed successfully or not).
Using the btl_gm_min_rdma_size flag, OB1 now completes without failed
residuals in HPL.
This flag sets the threshold where BTL will fragment RDMAs (not start
using RDMA) per $OMPI/ompi/mca/btl/btl.h:
size_t btl_min_rdma_size; /**< threshold below which the
BTL should not fragment */
size_t btl_max_rdma_size; /**< maximum rdma fragment
size supported by the BTL */
We believe it is the fragmenting of RDMAs on OSX that is causing the
issue. It does not happen on x86 or x86_64.
Scott
On Dec 7, 2006, at 2:20 PM, George Bosilca wrote:
Something is not clear for me in this discussion. Sometimes the
subject was the DR PML and sometimes the OB1 PML. In fact I'm
completely in the dark ... Which PML fails the HPCC test on MAC ?
When I look at the command line it look like it should be OB1 not
DR ...
george.
On Dec 7, 2006, at 1:59 PM, Brock Palen wrote:
That is wonderful, that fixes the observed problem for running with
OB1. Has a bug for this been filed to get RDMA working on macs?
The only working MPI lib is MPICH-GM as this problem happens with
LAM-7.1.3 also.
So on track for one bug.
Would the person working on the DR PML like me to try anymore tests?
Brock Palen
Center for Advanced Computing
bro...@umich.edu
(734)936-1985
On Dec 7, 2006, at 9:50 AM, Scott Atchley wrote:
On Dec 6, 2006, at 3:09 PM, Scott Atchley wrote:
Brock and Galen,
We are willing to assist. Our best guess is that OMPI is using the
code in a way different than MPICH-GM does. One of our other
developers who is more comfortable with the GM API is looking into
it.
We tried running with HPCC with:
$ mpirun -np 4 -machinefile hosts -mca btl ^tcp -mca
btl_gm_min_rdma_size $((10*1024*1024)) ./hpcc.ompi.gm
and HPL passes. The problem seems to be in the RDMA fragmenting code
on OSX. The boundary values at the edges of the fragments are not
correct.
Scott
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users