Indeed odd - I'm afraid that this is just the kind of case that has been 
causing problems. I think I've figured out the problem, but have been buried 
with my "day job" for the last few weeks and unable to pursue it.


On Aug 18, 2014, at 11:10 AM, Maxime Boissonneault 
<maxime.boissonnea...@calculquebec.ca> wrote:

> Ok, I confirm that with
> mpiexec -mca oob_tcp_if_include lo ring_c
> 
> it works.
> 
> It also works with
> mpiexec -mca oob_tcp_if_include ib0 ring_c
> 
> We have 4 interfaces on this node.
> - lo, the local loop
> - ib0, infiniband
> - eth2, a management network
> - eth3, the public network
> 
> It seems that mpiexec attempts to use the two addresses that do not work 
> (eth2, eth3) and does not use the two that do work (ib0 and lo). However, 
> according to the logs sent previously, it does see ib0 (despite not seeing 
> lo), but does not attempt to use it.
> 
> 
> On the compute nodes, we have eth0 (management), ib0 and lo, and it works. I 
> am unsure why it does work on the compute nodes and not on the login nodes. 
> The only difference is the presence of a public interface on the login node.
> 
> Maxime
> 
> 
> Le 2014-08-18 13:37, Ralph Castain a écrit :
>> Yeah, there are some issues with the internal connection logic that need to 
>> get fixed. We haven't had many cases where it's been an issue, but a couple 
>> like this have cropped up - enough that I need to set aside some time to fix 
>> it.
>> 
>> My apologies for the problem.
>> 
>> 
>> On Aug 18, 2014, at 10:31 AM, Maxime Boissonneault 
>> <maxime.boissonnea...@calculquebec.ca> wrote:
>> 
>>> Indeed, that makes sense now.
>>> 
>>> Why isn't OpenMPI attempting to connect with the local loop for same node ? 
>>> This used to work with 1.6.5.
>>> 
>>> Maxime
>>> 
>>> Le 2014-08-18 13:11, Ralph Castain a écrit :
>>>> Yep, that pinpointed the problem:
>>>> 
>>>> [helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING
>>>> [helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer 
>>>> [[63019,0],0] on socket 11
>>>> [helios-login1:28558] [[63019,1],0]-[[63019,0],0] 
>>>> tcp_peer_complete_connect: connection failed: Connection refused (111)
>>>> [helios-login1:28558] [[63019,1],0] tcp_peer_close for [[63019,0],0] sd 11 
>>>> state CONNECTING
>>>> [helios-login1:28558] [[63019,1],0] tcp:lost connection called for peer 
>>>> [[63019,0],0]
>>>> 
>>>> 
>>>> The apps are trying to connect back to mpirun using the following 
>>>> addresses:
>>>> 
>>>> tcp://10.10.1.3,132.219.137.36,10.12.1.3:34237
>>>> 
>>>> The initial attempt is here
>>>> 
>>>> [helios-login1:28558] [[63019,1],0] orte_tcp_peer_try_connect: attempting 
>>>> to connect to proc [[63019,0],0] on 10.10.1.3:34237 - 0 retries
>>>> 
>>>> I know there is a failover bug in the 1.8 series, and so if that 
>>>> connection got rejected the proc would abort. Should we be using a 
>>>> different network? If so, telling us via the oob_tcp_if_include param 
>>>> would be the solution.
>>>> 
>>>> 
>>>> On Aug 18, 2014, at 10:04 AM, Maxime Boissonneault 
>>>> <maxime.boissonnea...@calculquebec.ca> wrote:
>>>> 
>>>>> Here it is.
>>>>> 
>>>>> Maxime
>>>>> 
>>>>> Le 2014-08-18 12:59, Ralph Castain a écrit :
>>>>>> Ah...now that showed the problem. To pinpoint it better, please add
>>>>>> 
>>>>>> -mca oob_base_verbose 10
>>>>>> 
>>>>>> and I think we'll have it
>>>>>> 
>>>>>> On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault 
>>>>>> <maxime.boissonnea...@calculquebec.ca> wrote:
>>>>>> 
>>>>>>> This is all one one node indeed.
>>>>>>> 
>>>>>>> Attached is the output of
>>>>>>> mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca 
>>>>>>> state_base_verbose 5 -mca errmgr_base_verbose 5  ring_c |& tee 
>>>>>>> output_ringc_verbose.txt
>>>>>>> 
>>>>>>> 
>>>>>>> Maxime
>>>>>>> 
>>>>>>> Le 2014-08-18 12:48, Ralph Castain a écrit :
>>>>>>>> This is all on one node, yes?
>>>>>>>> 
>>>>>>>> Try adding the following:
>>>>>>>> 
>>>>>>>> -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca 
>>>>>>>> errmgr_base_verbose 5
>>>>>>>> 
>>>>>>>> Lot of garbage, but should tell us what is going on.
>>>>>>>> 
>>>>>>>> On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault 
>>>>>>>> <maxime.boissonnea...@calculquebec.ca> wrote:
>>>>>>>> 
>>>>>>>>> Here it is
>>>>>>>>> Le 2014-08-18 12:30, Joshua Ladd a écrit :
>>>>>>>>>> mpirun -np 4 --mca plm_base_verbose 10
>>>>>>>>> [mboisson@helios-login1 examples]$ mpirun -np 4 --mca 
>>>>>>>>> plm_base_verbose 10 ring_c
>>>>>>>>> [helios-login1:27853] mca: base: components_register: registering plm 
>>>>>>>>> components
>>>>>>>>> [helios-login1:27853] mca: base: components_register: found loaded 
>>>>>>>>> component isolated
>>>>>>>>> [helios-login1:27853] mca: base: components_register: component 
>>>>>>>>> isolated has no register or open function
>>>>>>>>> [helios-login1:27853] mca: base: components_register: found loaded 
>>>>>>>>> component rsh
>>>>>>>>> [helios-login1:27853] mca: base: components_register: component rsh 
>>>>>>>>> register function successful
>>>>>>>>> [helios-login1:27853] mca: base: components_register: found loaded 
>>>>>>>>> component tm
>>>>>>>>> [helios-login1:27853] mca: base: components_register: component tm 
>>>>>>>>> register function successful
>>>>>>>>> [helios-login1:27853] mca: base: components_open: opening plm 
>>>>>>>>> components
>>>>>>>>> [helios-login1:27853] mca: base: components_open: found loaded 
>>>>>>>>> component isolated
>>>>>>>>> [helios-login1:27853] mca: base: components_open: component isolated 
>>>>>>>>> open function successful
>>>>>>>>> [helios-login1:27853] mca: base: components_open: found loaded 
>>>>>>>>> component rsh
>>>>>>>>> [helios-login1:27853] mca: base: components_open: component rsh open 
>>>>>>>>> function successful
>>>>>>>>> [helios-login1:27853] mca: base: components_open: found loaded 
>>>>>>>>> component tm
>>>>>>>>> [helios-login1:27853] mca: base: components_open: component tm open 
>>>>>>>>> function successful
>>>>>>>>> [helios-login1:27853] mca:base:select: Auto-selecting plm components
>>>>>>>>> [helios-login1:27853] mca:base:select:(  plm) Querying component 
>>>>>>>>> [isolated]
>>>>>>>>> [helios-login1:27853] mca:base:select:(  plm) Query of component 
>>>>>>>>> [isolated] set priority to 0
>>>>>>>>> [helios-login1:27853] mca:base:select:(  plm) Querying component [rsh]
>>>>>>>>> [helios-login1:27853] mca:base:select:(  plm) Query of component 
>>>>>>>>> [rsh] set priority to 10
>>>>>>>>> [helios-login1:27853] mca:base:select:(  plm) Querying component [tm]
>>>>>>>>> [helios-login1:27853] mca:base:select:(  plm) Skipping component 
>>>>>>>>> [tm]. Query failed to return a module
>>>>>>>>> [helios-login1:27853] mca:base:select:(  plm) Selected component [rsh]
>>>>>>>>> [helios-login1:27853] mca: base: close: component isolated closed
>>>>>>>>> [helios-login1:27853] mca: base: close: unloading component isolated
>>>>>>>>> [helios-login1:27853] mca: base: close: component tm closed
>>>>>>>>> [helios-login1:27853] mca: base: close: unloading component tm
>>>>>>>>> [helios-login1:27853] mca: base: close: component rsh closed
>>>>>>>>> [helios-login1:27853] mca: base: close: unloading component rsh
>>>>>>>>> [mboisson@helios-login1 examples]$ echo $?
>>>>>>>>> 65
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Maxime
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org
>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> Link to this post: 
>>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25052.php
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> Link to this post: 
>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25053.php
>>>>>>> -- 
>>>>>>> ---------------------------------
>>>>>>> Maxime Boissonneault
>>>>>>> Analyste de calcul - Calcul Québec, Université Laval
>>>>>>> Ph. D. en physique
>>>>>>> 
>>>>>>> <output_ringc_verbose.txt.gz>_______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25054.php
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25055.php
>>>>> -- 
>>>>> ---------------------------------
>>>>> Maxime Boissonneault
>>>>> Analyste de calcul - Calcul Québec, Université Laval
>>>>> Ph. D. en physique
>>>>> 
>>>>> <output_ringc_verbose2.txt.gz>_______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25056.php
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2014/08/25057.php
>>> 
>>> -- 
>>> ---------------------------------
>>> Maxime Boissonneault
>>> Analyste de calcul - Calcul Québec, Université Laval
>>> Ph. D. en physique
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/08/25058.php
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/08/25059.php
> 
> 
> -- 
> ---------------------------------
> Maxime Boissonneault
> Analyste de calcul - Calcul Québec, Université Laval
> Ph. D. en physique
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25060.php

Reply via email to