Yeah, there are some issues with the internal connection logic that need to get 
fixed. We haven't had many cases where it's been an issue, but a couple like 
this have cropped up - enough that I need to set aside some time to fix it.

My apologies for the problem.


On Aug 18, 2014, at 10:31 AM, Maxime Boissonneault 
<maxime.boissonnea...@calculquebec.ca> wrote:

> Indeed, that makes sense now.
> 
> Why isn't OpenMPI attempting to connect with the local loop for same node ? 
> This used to work with 1.6.5.
> 
> Maxime
> 
> Le 2014-08-18 13:11, Ralph Castain a écrit :
>> Yep, that pinpointed the problem:
>> 
>> [helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING
>> [helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer 
>> [[63019,0],0] on socket 11
>> [helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect: 
>> connection failed: Connection refused (111)
>> [helios-login1:28558] [[63019,1],0] tcp_peer_close for [[63019,0],0] sd 11 
>> state CONNECTING
>> [helios-login1:28558] [[63019,1],0] tcp:lost connection called for peer 
>> [[63019,0],0]
>> 
>> 
>> The apps are trying to connect back to mpirun using the following addresses:
>> 
>> tcp://10.10.1.3,132.219.137.36,10.12.1.3:34237
>> 
>> The initial attempt is here
>> 
>> [helios-login1:28558] [[63019,1],0] orte_tcp_peer_try_connect: attempting to 
>> connect to proc [[63019,0],0] on 10.10.1.3:34237 - 0 retries
>> 
>> I know there is a failover bug in the 1.8 series, and so if that connection 
>> got rejected the proc would abort. Should we be using a different network? 
>> If so, telling us via the oob_tcp_if_include param would be the solution.
>> 
>> 
>> On Aug 18, 2014, at 10:04 AM, Maxime Boissonneault 
>> <maxime.boissonnea...@calculquebec.ca> wrote:
>> 
>>> Here it is.
>>> 
>>> Maxime
>>> 
>>> Le 2014-08-18 12:59, Ralph Castain a écrit :
>>>> Ah...now that showed the problem. To pinpoint it better, please add
>>>> 
>>>> -mca oob_base_verbose 10
>>>> 
>>>> and I think we'll have it
>>>> 
>>>> On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault 
>>>> <maxime.boissonnea...@calculquebec.ca> wrote:
>>>> 
>>>>> This is all one one node indeed.
>>>>> 
>>>>> Attached is the output of
>>>>> mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca 
>>>>> state_base_verbose 5 -mca errmgr_base_verbose 5  ring_c |& tee 
>>>>> output_ringc_verbose.txt
>>>>> 
>>>>> 
>>>>> Maxime
>>>>> 
>>>>> Le 2014-08-18 12:48, Ralph Castain a écrit :
>>>>>> This is all on one node, yes?
>>>>>> 
>>>>>> Try adding the following:
>>>>>> 
>>>>>> -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca 
>>>>>> errmgr_base_verbose 5
>>>>>> 
>>>>>> Lot of garbage, but should tell us what is going on.
>>>>>> 
>>>>>> On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault 
>>>>>> <maxime.boissonnea...@calculquebec.ca> wrote:
>>>>>> 
>>>>>>> Here it is
>>>>>>> Le 2014-08-18 12:30, Joshua Ladd a écrit :
>>>>>>>> mpirun -np 4 --mca plm_base_verbose 10
>>>>>>> [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 
>>>>>>> 10 ring_c
>>>>>>> [helios-login1:27853] mca: base: components_register: registering plm 
>>>>>>> components
>>>>>>> [helios-login1:27853] mca: base: components_register: found loaded 
>>>>>>> component isolated
>>>>>>> [helios-login1:27853] mca: base: components_register: component 
>>>>>>> isolated has no register or open function
>>>>>>> [helios-login1:27853] mca: base: components_register: found loaded 
>>>>>>> component rsh
>>>>>>> [helios-login1:27853] mca: base: components_register: component rsh 
>>>>>>> register function successful
>>>>>>> [helios-login1:27853] mca: base: components_register: found loaded 
>>>>>>> component tm
>>>>>>> [helios-login1:27853] mca: base: components_register: component tm 
>>>>>>> register function successful
>>>>>>> [helios-login1:27853] mca: base: components_open: opening plm components
>>>>>>> [helios-login1:27853] mca: base: components_open: found loaded 
>>>>>>> component isolated
>>>>>>> [helios-login1:27853] mca: base: components_open: component isolated 
>>>>>>> open function successful
>>>>>>> [helios-login1:27853] mca: base: components_open: found loaded 
>>>>>>> component rsh
>>>>>>> [helios-login1:27853] mca: base: components_open: component rsh open 
>>>>>>> function successful
>>>>>>> [helios-login1:27853] mca: base: components_open: found loaded 
>>>>>>> component tm
>>>>>>> [helios-login1:27853] mca: base: components_open: component tm open 
>>>>>>> function successful
>>>>>>> [helios-login1:27853] mca:base:select: Auto-selecting plm components
>>>>>>> [helios-login1:27853] mca:base:select:(  plm) Querying component 
>>>>>>> [isolated]
>>>>>>> [helios-login1:27853] mca:base:select:(  plm) Query of component 
>>>>>>> [isolated] set priority to 0
>>>>>>> [helios-login1:27853] mca:base:select:(  plm) Querying component [rsh]
>>>>>>> [helios-login1:27853] mca:base:select:(  plm) Query of component [rsh] 
>>>>>>> set priority to 10
>>>>>>> [helios-login1:27853] mca:base:select:(  plm) Querying component [tm]
>>>>>>> [helios-login1:27853] mca:base:select:(  plm) Skipping component [tm]. 
>>>>>>> Query failed to return a module
>>>>>>> [helios-login1:27853] mca:base:select:(  plm) Selected component [rsh]
>>>>>>> [helios-login1:27853] mca: base: close: component isolated closed
>>>>>>> [helios-login1:27853] mca: base: close: unloading component isolated
>>>>>>> [helios-login1:27853] mca: base: close: component tm closed
>>>>>>> [helios-login1:27853] mca: base: close: unloading component tm
>>>>>>> [helios-login1:27853] mca: base: close: component rsh closed
>>>>>>> [helios-login1:27853] mca: base: close: unloading component rsh
>>>>>>> [mboisson@helios-login1 examples]$ echo $?
>>>>>>> 65
>>>>>>> 
>>>>>>> 
>>>>>>> Maxime
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25052.php
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25053.php
>>>>> -- 
>>>>> ---------------------------------
>>>>> Maxime Boissonneault
>>>>> Analyste de calcul - Calcul Québec, Université Laval
>>>>> Ph. D. en physique
>>>>> 
>>>>> <output_ringc_verbose.txt.gz>_______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25054.php
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2014/08/25055.php
>>> 
>>> -- 
>>> ---------------------------------
>>> Maxime Boissonneault
>>> Analyste de calcul - Calcul Québec, Université Laval
>>> Ph. D. en physique
>>> 
>>> <output_ringc_verbose2.txt.gz>_______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/08/25056.php
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/08/25057.php
> 
> 
> -- 
> ---------------------------------
> Maxime Boissonneault
> Analyste de calcul - Calcul Québec, Université Laval
> Ph. D. en physique
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25058.php

Reply via email to