Yep, that pinpointed the problem:

[helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING
[helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer 
[[63019,0],0] on socket 11
[helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect: 
connection failed: Connection refused (111)
[helios-login1:28558] [[63019,1],0] tcp_peer_close for [[63019,0],0] sd 11 
state CONNECTING
[helios-login1:28558] [[63019,1],0] tcp:lost connection called for peer 
[[63019,0],0]


The apps are trying to connect back to mpirun using the following addresses:

tcp://10.10.1.3,132.219.137.36,10.12.1.3:34237

The initial attempt is here

[helios-login1:28558] [[63019,1],0] orte_tcp_peer_try_connect: attempting to 
connect to proc [[63019,0],0] on 10.10.1.3:34237 - 0 retries

I know there is a failover bug in the 1.8 series, and so if that connection got 
rejected the proc would abort. Should we be using a different network? If so, 
telling us via the oob_tcp_if_include param would be the solution.


On Aug 18, 2014, at 10:04 AM, Maxime Boissonneault 
<maxime.boissonnea...@calculquebec.ca> wrote:

> Here it is.
> 
> Maxime
> 
> Le 2014-08-18 12:59, Ralph Castain a écrit :
>> Ah...now that showed the problem. To pinpoint it better, please add
>> 
>> -mca oob_base_verbose 10
>> 
>> and I think we'll have it
>> 
>> On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault 
>> <maxime.boissonnea...@calculquebec.ca> wrote:
>> 
>>> This is all one one node indeed.
>>> 
>>> Attached is the output of
>>> mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca 
>>> state_base_verbose 5 -mca errmgr_base_verbose 5  ring_c |& tee 
>>> output_ringc_verbose.txt
>>> 
>>> 
>>> Maxime
>>> 
>>> Le 2014-08-18 12:48, Ralph Castain a écrit :
>>>> This is all on one node, yes?
>>>> 
>>>> Try adding the following:
>>>> 
>>>> -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca 
>>>> errmgr_base_verbose 5
>>>> 
>>>> Lot of garbage, but should tell us what is going on.
>>>> 
>>>> On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault 
>>>> <maxime.boissonnea...@calculquebec.ca> wrote:
>>>> 
>>>>> Here it is
>>>>> Le 2014-08-18 12:30, Joshua Ladd a écrit :
>>>>>> mpirun -np 4 --mca plm_base_verbose 10
>>>>> [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 
>>>>> ring_c
>>>>> [helios-login1:27853] mca: base: components_register: registering plm 
>>>>> components
>>>>> [helios-login1:27853] mca: base: components_register: found loaded 
>>>>> component isolated
>>>>> [helios-login1:27853] mca: base: components_register: component isolated 
>>>>> has no register or open function
>>>>> [helios-login1:27853] mca: base: components_register: found loaded 
>>>>> component rsh
>>>>> [helios-login1:27853] mca: base: components_register: component rsh 
>>>>> register function successful
>>>>> [helios-login1:27853] mca: base: components_register: found loaded 
>>>>> component tm
>>>>> [helios-login1:27853] mca: base: components_register: component tm 
>>>>> register function successful
>>>>> [helios-login1:27853] mca: base: components_open: opening plm components
>>>>> [helios-login1:27853] mca: base: components_open: found loaded component 
>>>>> isolated
>>>>> [helios-login1:27853] mca: base: components_open: component isolated open 
>>>>> function successful
>>>>> [helios-login1:27853] mca: base: components_open: found loaded component 
>>>>> rsh
>>>>> [helios-login1:27853] mca: base: components_open: component rsh open 
>>>>> function successful
>>>>> [helios-login1:27853] mca: base: components_open: found loaded component 
>>>>> tm
>>>>> [helios-login1:27853] mca: base: components_open: component tm open 
>>>>> function successful
>>>>> [helios-login1:27853] mca:base:select: Auto-selecting plm components
>>>>> [helios-login1:27853] mca:base:select:(  plm) Querying component 
>>>>> [isolated]
>>>>> [helios-login1:27853] mca:base:select:(  plm) Query of component 
>>>>> [isolated] set priority to 0
>>>>> [helios-login1:27853] mca:base:select:(  plm) Querying component [rsh]
>>>>> [helios-login1:27853] mca:base:select:(  plm) Query of component [rsh] 
>>>>> set priority to 10
>>>>> [helios-login1:27853] mca:base:select:(  plm) Querying component [tm]
>>>>> [helios-login1:27853] mca:base:select:(  plm) Skipping component [tm]. 
>>>>> Query failed to return a module
>>>>> [helios-login1:27853] mca:base:select:(  plm) Selected component [rsh]
>>>>> [helios-login1:27853] mca: base: close: component isolated closed
>>>>> [helios-login1:27853] mca: base: close: unloading component isolated
>>>>> [helios-login1:27853] mca: base: close: component tm closed
>>>>> [helios-login1:27853] mca: base: close: unloading component tm
>>>>> [helios-login1:27853] mca: base: close: component rsh closed
>>>>> [helios-login1:27853] mca: base: close: unloading component rsh
>>>>> [mboisson@helios-login1 examples]$ echo $?
>>>>> 65
>>>>> 
>>>>> 
>>>>> Maxime
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25052.php
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2014/08/25053.php
>>> 
>>> -- 
>>> ---------------------------------
>>> Maxime Boissonneault
>>> Analyste de calcul - Calcul Québec, Université Laval
>>> Ph. D. en physique
>>> 
>>> <output_ringc_verbose.txt.gz>_______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/08/25054.php
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/08/25055.php
> 
> 
> -- 
> ---------------------------------
> Maxime Boissonneault
> Analyste de calcul - Calcul Québec, Université Laval
> Ph. D. en physique
> 
> <output_ringc_verbose2.txt.gz>_______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25056.php

Reply via email to