Yep, that pinpointed the problem: [helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING [helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer [[63019,0],0] on socket 11 [helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect: connection failed: Connection refused (111) [helios-login1:28558] [[63019,1],0] tcp_peer_close for [[63019,0],0] sd 11 state CONNECTING [helios-login1:28558] [[63019,1],0] tcp:lost connection called for peer [[63019,0],0]
The apps are trying to connect back to mpirun using the following addresses: tcp://10.10.1.3,132.219.137.36,10.12.1.3:34237 The initial attempt is here [helios-login1:28558] [[63019,1],0] orte_tcp_peer_try_connect: attempting to connect to proc [[63019,0],0] on 10.10.1.3:34237 - 0 retries I know there is a failover bug in the 1.8 series, and so if that connection got rejected the proc would abort. Should we be using a different network? If so, telling us via the oob_tcp_if_include param would be the solution. On Aug 18, 2014, at 10:04 AM, Maxime Boissonneault <maxime.boissonnea...@calculquebec.ca> wrote: > Here it is. > > Maxime > > Le 2014-08-18 12:59, Ralph Castain a écrit : >> Ah...now that showed the problem. To pinpoint it better, please add >> >> -mca oob_base_verbose 10 >> >> and I think we'll have it >> >> On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault >> <maxime.boissonnea...@calculquebec.ca> wrote: >> >>> This is all one one node indeed. >>> >>> Attached is the output of >>> mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca >>> state_base_verbose 5 -mca errmgr_base_verbose 5 ring_c |& tee >>> output_ringc_verbose.txt >>> >>> >>> Maxime >>> >>> Le 2014-08-18 12:48, Ralph Castain a écrit : >>>> This is all on one node, yes? >>>> >>>> Try adding the following: >>>> >>>> -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca >>>> errmgr_base_verbose 5 >>>> >>>> Lot of garbage, but should tell us what is going on. >>>> >>>> On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault >>>> <maxime.boissonnea...@calculquebec.ca> wrote: >>>> >>>>> Here it is >>>>> Le 2014-08-18 12:30, Joshua Ladd a écrit : >>>>>> mpirun -np 4 --mca plm_base_verbose 10 >>>>> [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 >>>>> ring_c >>>>> [helios-login1:27853] mca: base: components_register: registering plm >>>>> components >>>>> [helios-login1:27853] mca: base: components_register: found loaded >>>>> component isolated >>>>> [helios-login1:27853] mca: base: components_register: component isolated >>>>> has no register or open function >>>>> [helios-login1:27853] mca: base: components_register: found loaded >>>>> component rsh >>>>> [helios-login1:27853] mca: base: components_register: component rsh >>>>> register function successful >>>>> [helios-login1:27853] mca: base: components_register: found loaded >>>>> component tm >>>>> [helios-login1:27853] mca: base: components_register: component tm >>>>> register function successful >>>>> [helios-login1:27853] mca: base: components_open: opening plm components >>>>> [helios-login1:27853] mca: base: components_open: found loaded component >>>>> isolated >>>>> [helios-login1:27853] mca: base: components_open: component isolated open >>>>> function successful >>>>> [helios-login1:27853] mca: base: components_open: found loaded component >>>>> rsh >>>>> [helios-login1:27853] mca: base: components_open: component rsh open >>>>> function successful >>>>> [helios-login1:27853] mca: base: components_open: found loaded component >>>>> tm >>>>> [helios-login1:27853] mca: base: components_open: component tm open >>>>> function successful >>>>> [helios-login1:27853] mca:base:select: Auto-selecting plm components >>>>> [helios-login1:27853] mca:base:select:( plm) Querying component >>>>> [isolated] >>>>> [helios-login1:27853] mca:base:select:( plm) Query of component >>>>> [isolated] set priority to 0 >>>>> [helios-login1:27853] mca:base:select:( plm) Querying component [rsh] >>>>> [helios-login1:27853] mca:base:select:( plm) Query of component [rsh] >>>>> set priority to 10 >>>>> [helios-login1:27853] mca:base:select:( plm) Querying component [tm] >>>>> [helios-login1:27853] mca:base:select:( plm) Skipping component [tm]. >>>>> Query failed to return a module >>>>> [helios-login1:27853] mca:base:select:( plm) Selected component [rsh] >>>>> [helios-login1:27853] mca: base: close: component isolated closed >>>>> [helios-login1:27853] mca: base: close: unloading component isolated >>>>> [helios-login1:27853] mca: base: close: component tm closed >>>>> [helios-login1:27853] mca: base: close: unloading component tm >>>>> [helios-login1:27853] mca: base: close: component rsh closed >>>>> [helios-login1:27853] mca: base: close: unloading component rsh >>>>> [mboisson@helios-login1 examples]$ echo $? >>>>> 65 >>>>> >>>>> >>>>> Maxime >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2014/08/25052.php >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/08/25053.php >>> >>> -- >>> --------------------------------- >>> Maxime Boissonneault >>> Analyste de calcul - Calcul Québec, Université Laval >>> Ph. D. en physique >>> >>> <output_ringc_verbose.txt.gz>_______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/08/25054.php >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/08/25055.php > > > -- > --------------------------------- > Maxime Boissonneault > Analyste de calcul - Calcul Québec, Université Laval > Ph. D. en physique > > <output_ringc_verbose2.txt.gz>_______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25056.php