On Dec 7, 2006, at 3:14 PM, George Bosilca wrote:
On Dec 7, 2006, at 2:45 PM, Brock Palen wrote:
$ mpirun -np 4 -machinefile hosts -mca btl ^tcp -mca
btl_gm_min_rdma_size $((10*1024*1024)) ./hpcc.ompi.gm
and HPL passes. The problem seems to be in the RDMA fragmenting
code
on OSX. The boundary values at the edges of the fragments are not
correct.
Here it look like the OB1 PML was used. In order to get HPL to
complete successfully we need to set the btl_gm_min_rdma_size to
10MB. What I suspect is that 10MB is more than the size of any
message HPL exchange, so adding this MCA parameter effectively
disable the RDMA protocol for GM.
This seems to pinpoint a more complex problem which might not be
related to the PML. If both PMLs (OB1 and DR) have a similar problem
when running on top of the GM BTL it might indicate the problem is
down in the GM BTL. Can you confirm that running with OB1 and GM on
this particular cluster HPL fails ?
If not modifying the btl_gm_min_rdma_size the run fails with bad
results when using OB1.
If btl_gm_min_rdma is modified (as you pointed out basically
disabled then) It no-longer fails.
Using DR over ethernet (--mca btl ^gm) or over gm (with and
without the btl_gm_min_rdma_size modified) does not even start up.
(nothing on stdout stderr and never exits).
Yes there is a problem at the btl level. But because the problem is
different and presists across both GM and TCP, I believe we are into
two separate issues. But I am not the person to make that call.
Brock
Thanks,
george.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users