I thought about it again: There's probably no call to dat_ep_query()
*because* it returns wrong port numbers and the port numbers saved by
the uDAPL BTL code itself are used.

I'll leave the debugging to those who know the code ... ;-)

Boris


Andrew Friedley wrote:
> OK, strange but good.  Yeah I wouldn't be surprised if something has 
> been changed, though I wouldn't know what, and I don't have time right 
> now to go digging :(  Maybe Don Kerr knows something?
> 
> Andrew
> 
> 
> Boris Bierbaum wrote:
>> I've run the whole IMB Benchmark Suite on 2, 3, and 4 nodes with 2
>> processes per node and --mca btl udapl,self. I didn't encouter any problems.
>>
>> The comment above line 197 says that dat_ep_query() returns wrong port
>> numbers (which it does indeed), but I can't find any call to
>> dat_ep_query() in the uDAPL BTL code. Maybe the comment is out of date?
>>
>> Boris
>>
>>
>> Andrew Friedley wrote:
>>> You say that fixes the problem, does it work even when running more than 
>>> one MPI process per node? (that is the case the hack fixes)  Simply 
>>> doing an mpirun with a -np paremeter higher than the number of nodes you 
>>> have set up should trigger this case, and making sure to use '-mca btl 
>>> udapl,self' (ie not SM or anything else).
>>>
>>> Andrew
>>>
>>> Boris Bierbaum wrote:
>>>> It has been explained in a different thread on [ofa-general] that the
>>>> problem lies in a combination of the OpenIB-cma provider not setting the
>>>> local and remote port numbers on endpoints correctly and Open MPI
>>>> stepping over the IA to save the port number to circumvent this problem,
>>>> thereby confusing the provider.
>>>>
>>>> I commented out line 197 in ompi/mca/btl/udapl/btl_udapl.c (Open MPI
>>>> 1.2.1 release) and this fixes the problem. As the problem in the
>>>> provider is currently being fixed, the whole saving of the port number
>>>> in the uDAPL BTL code will be unnecessary in the future.
>>>>
>>>> Steve Wise wrote:
>>>>>>> Can the UDAPL OFED wizards shed any light on the error messages that  
>>>>>>> are listed below?  In particular, these seem to be worrysome:
>>>>>>>
>>>>>>>>  setup_listener Permission denied
>>>>>>>  setup_listener Address already in use
>>>>>> These failures are from rdma_cm_bind indicating the port is already 
>>>>>> bound to this IA address. How are you creating the service point?
>>>>>> dat_psp_create or dat_psp_create_any? If it is psp_create_any then you 
>>>>>> will see some failures until it  gets to a free port. That is normal. 
>>>>>> Just make sure your create call returns DAT_SUCCESS.
>>>>>>
>>>>> Arlin, why doesn't dapl_psp_create_any() just pass a port of zero down
>>>>> and let the rdma-cma pick an available port number?
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> general mailing list
>>>>> gene...@lists.openfabrics.org
>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>>
>>>>> To unsubscribe, please visit 
>>>>> http://openib.org/mailman/listinfo/openib-general
>>>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


-- 
|  _  RWTH | Boris Bierbaum
|_|_`_     | Lehrstuhl fuer Betriebssysteme
   | |_) _  | RWTH Aachen D-52056 Aachen
     |_)(_` | Tel: +49-241-80-27805
        ._) | Fax: +49-241-80-22339

Reply via email to