George,

Using DR was suggested to see if it could find an error. The original problem was using OB1, and HPL gave failed residuals. The hope was that DR would pinpoint any problems. It did not and HPL did not progress at all (the GM counters incremented, but no tests were completed successfully or not).

Using the btl_gm_min_rdma_size flag, OB1 now completes without failed residuals in HPL.

This flag sets the threshold where BTL will fragment RDMAs (not start using RDMA) per $OMPI/ompi/mca/btl/btl.h:

size_t btl_min_rdma_size; /**< threshold below which the BTL should not fragment */ size_t btl_max_rdma_size; /**< maximum rdma fragment size supported by the BTL */

We believe it is the fragmenting of RDMAs on OSX that is causing the issue. It does not happen on x86 or x86_64.

Scott

On Dec 7, 2006, at 2:20 PM, George Bosilca wrote:

Something is not clear for me in this discussion. Sometimes the
subject was the DR PML and sometimes the OB1 PML. In fact I'm
completely in the dark ... Which PML fails the HPCC test on  MAC ?
When I look at the command line it look like it should be OB1 not DR ...

   george.

On Dec 7, 2006, at 1:59 PM, Brock Palen wrote:

That is wonderful, that fixes the observed problem for running with
OB1.   Has a bug for this been filed to get RDMA working on macs?
The only working MPI lib is MPICH-GM  as this problem happens with
LAM-7.1.3 also.

So on track for one bug.

Would the person working on the DR PML like me to try anymore tests?

Brock Palen
Center for Advanced Computing
bro...@umich.edu
(734)936-1985


On Dec 7, 2006, at 9:50 AM, Scott Atchley wrote:

On Dec 6, 2006, at 3:09 PM, Scott Atchley wrote:

Brock and Galen,

We are willing to assist. Our best guess is that OMPI is using the
code in a way different than MPICH-GM does. One of our other
developers who is more comfortable with the GM API is looking into
it.

We tried running with HPCC with:

$ mpirun -np 4 -machinefile hosts -mca btl ^tcp  -mca
btl_gm_min_rdma_size $((10*1024*1024)) ./hpcc.ompi.gm

and HPL passes. The problem seems to be in the RDMA fragmenting code
on OSX. The boundary values at the edges of the fragments are not
correct.

Scott
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to