Yeah, there are some issues with the internal connection logic that need to get fixed. We haven't had many cases where it's been an issue, but a couple like this have cropped up - enough that I need to set aside some time to fix it.
My apologies for the problem. On Aug 18, 2014, at 10:31 AM, Maxime Boissonneault <maxime.boissonnea...@calculquebec.ca> wrote: > Indeed, that makes sense now. > > Why isn't OpenMPI attempting to connect with the local loop for same node ? > This used to work with 1.6.5. > > Maxime > > Le 2014-08-18 13:11, Ralph Castain a écrit : >> Yep, that pinpointed the problem: >> >> [helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING >> [helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer >> [[63019,0],0] on socket 11 >> [helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect: >> connection failed: Connection refused (111) >> [helios-login1:28558] [[63019,1],0] tcp_peer_close for [[63019,0],0] sd 11 >> state CONNECTING >> [helios-login1:28558] [[63019,1],0] tcp:lost connection called for peer >> [[63019,0],0] >> >> >> The apps are trying to connect back to mpirun using the following addresses: >> >> tcp://10.10.1.3,132.219.137.36,10.12.1.3:34237 >> >> The initial attempt is here >> >> [helios-login1:28558] [[63019,1],0] orte_tcp_peer_try_connect: attempting to >> connect to proc [[63019,0],0] on 10.10.1.3:34237 - 0 retries >> >> I know there is a failover bug in the 1.8 series, and so if that connection >> got rejected the proc would abort. Should we be using a different network? >> If so, telling us via the oob_tcp_if_include param would be the solution. >> >> >> On Aug 18, 2014, at 10:04 AM, Maxime Boissonneault >> <maxime.boissonnea...@calculquebec.ca> wrote: >> >>> Here it is. >>> >>> Maxime >>> >>> Le 2014-08-18 12:59, Ralph Castain a écrit : >>>> Ah...now that showed the problem. To pinpoint it better, please add >>>> >>>> -mca oob_base_verbose 10 >>>> >>>> and I think we'll have it >>>> >>>> On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault >>>> <maxime.boissonnea...@calculquebec.ca> wrote: >>>> >>>>> This is all one one node indeed. >>>>> >>>>> Attached is the output of >>>>> mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca >>>>> state_base_verbose 5 -mca errmgr_base_verbose 5 ring_c |& tee >>>>> output_ringc_verbose.txt >>>>> >>>>> >>>>> Maxime >>>>> >>>>> Le 2014-08-18 12:48, Ralph Castain a écrit : >>>>>> This is all on one node, yes? >>>>>> >>>>>> Try adding the following: >>>>>> >>>>>> -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca >>>>>> errmgr_base_verbose 5 >>>>>> >>>>>> Lot of garbage, but should tell us what is going on. >>>>>> >>>>>> On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault >>>>>> <maxime.boissonnea...@calculquebec.ca> wrote: >>>>>> >>>>>>> Here it is >>>>>>> Le 2014-08-18 12:30, Joshua Ladd a écrit : >>>>>>>> mpirun -np 4 --mca plm_base_verbose 10 >>>>>>> [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose >>>>>>> 10 ring_c >>>>>>> [helios-login1:27853] mca: base: components_register: registering plm >>>>>>> components >>>>>>> [helios-login1:27853] mca: base: components_register: found loaded >>>>>>> component isolated >>>>>>> [helios-login1:27853] mca: base: components_register: component >>>>>>> isolated has no register or open function >>>>>>> [helios-login1:27853] mca: base: components_register: found loaded >>>>>>> component rsh >>>>>>> [helios-login1:27853] mca: base: components_register: component rsh >>>>>>> register function successful >>>>>>> [helios-login1:27853] mca: base: components_register: found loaded >>>>>>> component tm >>>>>>> [helios-login1:27853] mca: base: components_register: component tm >>>>>>> register function successful >>>>>>> [helios-login1:27853] mca: base: components_open: opening plm components >>>>>>> [helios-login1:27853] mca: base: components_open: found loaded >>>>>>> component isolated >>>>>>> [helios-login1:27853] mca: base: components_open: component isolated >>>>>>> open function successful >>>>>>> [helios-login1:27853] mca: base: components_open: found loaded >>>>>>> component rsh >>>>>>> [helios-login1:27853] mca: base: components_open: component rsh open >>>>>>> function successful >>>>>>> [helios-login1:27853] mca: base: components_open: found loaded >>>>>>> component tm >>>>>>> [helios-login1:27853] mca: base: components_open: component tm open >>>>>>> function successful >>>>>>> [helios-login1:27853] mca:base:select: Auto-selecting plm components >>>>>>> [helios-login1:27853] mca:base:select:( plm) Querying component >>>>>>> [isolated] >>>>>>> [helios-login1:27853] mca:base:select:( plm) Query of component >>>>>>> [isolated] set priority to 0 >>>>>>> [helios-login1:27853] mca:base:select:( plm) Querying component [rsh] >>>>>>> [helios-login1:27853] mca:base:select:( plm) Query of component [rsh] >>>>>>> set priority to 10 >>>>>>> [helios-login1:27853] mca:base:select:( plm) Querying component [tm] >>>>>>> [helios-login1:27853] mca:base:select:( plm) Skipping component [tm]. >>>>>>> Query failed to return a module >>>>>>> [helios-login1:27853] mca:base:select:( plm) Selected component [rsh] >>>>>>> [helios-login1:27853] mca: base: close: component isolated closed >>>>>>> [helios-login1:27853] mca: base: close: unloading component isolated >>>>>>> [helios-login1:27853] mca: base: close: component tm closed >>>>>>> [helios-login1:27853] mca: base: close: unloading component tm >>>>>>> [helios-login1:27853] mca: base: close: component rsh closed >>>>>>> [helios-login1:27853] mca: base: close: unloading component rsh >>>>>>> [mboisson@helios-login1 examples]$ echo $? >>>>>>> 65 >>>>>>> >>>>>>> >>>>>>> Maxime >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25052.php >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25053.php >>>>> -- >>>>> --------------------------------- >>>>> Maxime Boissonneault >>>>> Analyste de calcul - Calcul Québec, Université Laval >>>>> Ph. D. en physique >>>>> >>>>> <output_ringc_verbose.txt.gz>_______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2014/08/25054.php >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/08/25055.php >>> >>> -- >>> --------------------------------- >>> Maxime Boissonneault >>> Analyste de calcul - Calcul Québec, Université Laval >>> Ph. D. en physique >>> >>> <output_ringc_verbose2.txt.gz>_______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/08/25056.php >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/08/25057.php > > > -- > --------------------------------- > Maxime Boissonneault > Analyste de calcul - Calcul Québec, Université Laval > Ph. D. en physique > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25058.php