Indeed odd - I'm afraid that this is just the kind of case that has been causing problems. I think I've figured out the problem, but have been buried with my "day job" for the last few weeks and unable to pursue it.
On Aug 18, 2014, at 11:10 AM, Maxime Boissonneault <maxime.boissonnea...@calculquebec.ca> wrote: > Ok, I confirm that with > mpiexec -mca oob_tcp_if_include lo ring_c > > it works. > > It also works with > mpiexec -mca oob_tcp_if_include ib0 ring_c > > We have 4 interfaces on this node. > - lo, the local loop > - ib0, infiniband > - eth2, a management network > - eth3, the public network > > It seems that mpiexec attempts to use the two addresses that do not work > (eth2, eth3) and does not use the two that do work (ib0 and lo). However, > according to the logs sent previously, it does see ib0 (despite not seeing > lo), but does not attempt to use it. > > > On the compute nodes, we have eth0 (management), ib0 and lo, and it works. I > am unsure why it does work on the compute nodes and not on the login nodes. > The only difference is the presence of a public interface on the login node. > > Maxime > > > Le 2014-08-18 13:37, Ralph Castain a écrit : >> Yeah, there are some issues with the internal connection logic that need to >> get fixed. We haven't had many cases where it's been an issue, but a couple >> like this have cropped up - enough that I need to set aside some time to fix >> it. >> >> My apologies for the problem. >> >> >> On Aug 18, 2014, at 10:31 AM, Maxime Boissonneault >> <maxime.boissonnea...@calculquebec.ca> wrote: >> >>> Indeed, that makes sense now. >>> >>> Why isn't OpenMPI attempting to connect with the local loop for same node ? >>> This used to work with 1.6.5. >>> >>> Maxime >>> >>> Le 2014-08-18 13:11, Ralph Castain a écrit : >>>> Yep, that pinpointed the problem: >>>> >>>> [helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING >>>> [helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer >>>> [[63019,0],0] on socket 11 >>>> [helios-login1:28558] [[63019,1],0]-[[63019,0],0] >>>> tcp_peer_complete_connect: connection failed: Connection refused (111) >>>> [helios-login1:28558] [[63019,1],0] tcp_peer_close for [[63019,0],0] sd 11 >>>> state CONNECTING >>>> [helios-login1:28558] [[63019,1],0] tcp:lost connection called for peer >>>> [[63019,0],0] >>>> >>>> >>>> The apps are trying to connect back to mpirun using the following >>>> addresses: >>>> >>>> tcp://10.10.1.3,132.219.137.36,10.12.1.3:34237 >>>> >>>> The initial attempt is here >>>> >>>> [helios-login1:28558] [[63019,1],0] orte_tcp_peer_try_connect: attempting >>>> to connect to proc [[63019,0],0] on 10.10.1.3:34237 - 0 retries >>>> >>>> I know there is a failover bug in the 1.8 series, and so if that >>>> connection got rejected the proc would abort. Should we be using a >>>> different network? If so, telling us via the oob_tcp_if_include param >>>> would be the solution. >>>> >>>> >>>> On Aug 18, 2014, at 10:04 AM, Maxime Boissonneault >>>> <maxime.boissonnea...@calculquebec.ca> wrote: >>>> >>>>> Here it is. >>>>> >>>>> Maxime >>>>> >>>>> Le 2014-08-18 12:59, Ralph Castain a écrit : >>>>>> Ah...now that showed the problem. To pinpoint it better, please add >>>>>> >>>>>> -mca oob_base_verbose 10 >>>>>> >>>>>> and I think we'll have it >>>>>> >>>>>> On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault >>>>>> <maxime.boissonnea...@calculquebec.ca> wrote: >>>>>> >>>>>>> This is all one one node indeed. >>>>>>> >>>>>>> Attached is the output of >>>>>>> mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca >>>>>>> state_base_verbose 5 -mca errmgr_base_verbose 5 ring_c |& tee >>>>>>> output_ringc_verbose.txt >>>>>>> >>>>>>> >>>>>>> Maxime >>>>>>> >>>>>>> Le 2014-08-18 12:48, Ralph Castain a écrit : >>>>>>>> This is all on one node, yes? >>>>>>>> >>>>>>>> Try adding the following: >>>>>>>> >>>>>>>> -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca >>>>>>>> errmgr_base_verbose 5 >>>>>>>> >>>>>>>> Lot of garbage, but should tell us what is going on. >>>>>>>> >>>>>>>> On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault >>>>>>>> <maxime.boissonnea...@calculquebec.ca> wrote: >>>>>>>> >>>>>>>>> Here it is >>>>>>>>> Le 2014-08-18 12:30, Joshua Ladd a écrit : >>>>>>>>>> mpirun -np 4 --mca plm_base_verbose 10 >>>>>>>>> [mboisson@helios-login1 examples]$ mpirun -np 4 --mca >>>>>>>>> plm_base_verbose 10 ring_c >>>>>>>>> [helios-login1:27853] mca: base: components_register: registering plm >>>>>>>>> components >>>>>>>>> [helios-login1:27853] mca: base: components_register: found loaded >>>>>>>>> component isolated >>>>>>>>> [helios-login1:27853] mca: base: components_register: component >>>>>>>>> isolated has no register or open function >>>>>>>>> [helios-login1:27853] mca: base: components_register: found loaded >>>>>>>>> component rsh >>>>>>>>> [helios-login1:27853] mca: base: components_register: component rsh >>>>>>>>> register function successful >>>>>>>>> [helios-login1:27853] mca: base: components_register: found loaded >>>>>>>>> component tm >>>>>>>>> [helios-login1:27853] mca: base: components_register: component tm >>>>>>>>> register function successful >>>>>>>>> [helios-login1:27853] mca: base: components_open: opening plm >>>>>>>>> components >>>>>>>>> [helios-login1:27853] mca: base: components_open: found loaded >>>>>>>>> component isolated >>>>>>>>> [helios-login1:27853] mca: base: components_open: component isolated >>>>>>>>> open function successful >>>>>>>>> [helios-login1:27853] mca: base: components_open: found loaded >>>>>>>>> component rsh >>>>>>>>> [helios-login1:27853] mca: base: components_open: component rsh open >>>>>>>>> function successful >>>>>>>>> [helios-login1:27853] mca: base: components_open: found loaded >>>>>>>>> component tm >>>>>>>>> [helios-login1:27853] mca: base: components_open: component tm open >>>>>>>>> function successful >>>>>>>>> [helios-login1:27853] mca:base:select: Auto-selecting plm components >>>>>>>>> [helios-login1:27853] mca:base:select:( plm) Querying component >>>>>>>>> [isolated] >>>>>>>>> [helios-login1:27853] mca:base:select:( plm) Query of component >>>>>>>>> [isolated] set priority to 0 >>>>>>>>> [helios-login1:27853] mca:base:select:( plm) Querying component [rsh] >>>>>>>>> [helios-login1:27853] mca:base:select:( plm) Query of component >>>>>>>>> [rsh] set priority to 10 >>>>>>>>> [helios-login1:27853] mca:base:select:( plm) Querying component [tm] >>>>>>>>> [helios-login1:27853] mca:base:select:( plm) Skipping component >>>>>>>>> [tm]. Query failed to return a module >>>>>>>>> [helios-login1:27853] mca:base:select:( plm) Selected component [rsh] >>>>>>>>> [helios-login1:27853] mca: base: close: component isolated closed >>>>>>>>> [helios-login1:27853] mca: base: close: unloading component isolated >>>>>>>>> [helios-login1:27853] mca: base: close: component tm closed >>>>>>>>> [helios-login1:27853] mca: base: close: unloading component tm >>>>>>>>> [helios-login1:27853] mca: base: close: component rsh closed >>>>>>>>> [helios-login1:27853] mca: base: close: unloading component rsh >>>>>>>>> [mboisson@helios-login1 examples]$ echo $? >>>>>>>>> 65 >>>>>>>>> >>>>>>>>> >>>>>>>>> Maxime >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25052.php >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25053.php >>>>>>> -- >>>>>>> --------------------------------- >>>>>>> Maxime Boissonneault >>>>>>> Analyste de calcul - Calcul Québec, Université Laval >>>>>>> Ph. D. en physique >>>>>>> >>>>>>> <output_ringc_verbose.txt.gz>_______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25054.php >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25055.php >>>>> -- >>>>> --------------------------------- >>>>> Maxime Boissonneault >>>>> Analyste de calcul - Calcul Québec, Université Laval >>>>> Ph. D. en physique >>>>> >>>>> <output_ringc_verbose2.txt.gz>_______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2014/08/25056.php >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/08/25057.php >>> >>> -- >>> --------------------------------- >>> Maxime Boissonneault >>> Analyste de calcul - Calcul Québec, Université Laval >>> Ph. D. en physique >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/08/25058.php >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/08/25059.php > > > -- > --------------------------------- > Maxime Boissonneault > Analyste de calcul - Calcul Québec, Université Laval > Ph. D. en physique > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25060.php