On Dec 7, 2006, at 2:45 PM, Brock Palen wrote:


$ mpirun -np 4 -machinefile hosts -mca btl ^tcp  -mca
btl_gm_min_rdma_size $((10*1024*1024)) ./hpcc.ompi.gm

and HPL passes. The problem seems to be in the RDMA fragmenting code
on OSX. The boundary values at the edges of the fragments are not
correct.

Here it look like the OB1 PML was used. In order to get HPL to complete successfully we need to set the btl_gm_min_rdma_size to 10MB. What I suspect is that 10MB is more than the size of any message HPL exchange, so adding this MCA parameter effectively disable the RDMA protocol for GM.

This seems to pinpoint a more complex problem which might not be related to the PML. If both PMLs (OB1 and DR) have a similar problem when running on top of the GM BTL it might indicate the problem is down in the GM BTL. Can you confirm that running with OB1 and GM on this particular cluster HPL fails ?

  Thanks,
    george.

Reply via email to