Ok, I confirm that with
mpiexec -mca oob_tcp_if_include lo ring_c

it works.

It also works with
mpiexec -mca oob_tcp_if_include ib0 ring_c

We have 4 interfaces on this node.
- lo, the local loop
- ib0, infiniband
- eth2, a management network
- eth3, the public network

It seems that mpiexec attempts to use the two addresses that do not work (eth2, eth3) and does not use the two that do work (ib0 and lo). However, according to the logs sent previously, it does see ib0 (despite not seeing lo), but does not attempt to use it.


On the compute nodes, we have eth0 (management), ib0 and lo, and it works. I am unsure why it does work on the compute nodes and not on the login nodes. The only difference is the presence of a public interface on the login node.

Maxime


Le 2014-08-18 13:37, Ralph Castain a écrit :
Yeah, there are some issues with the internal connection logic that need to get 
fixed. We haven't had many cases where it's been an issue, but a couple like 
this have cropped up - enough that I need to set aside some time to fix it.

My apologies for the problem.


On Aug 18, 2014, at 10:31 AM, Maxime Boissonneault 
<maxime.boissonnea...@calculquebec.ca> wrote:

Indeed, that makes sense now.

Why isn't OpenMPI attempting to connect with the local loop for same node ? 
This used to work with 1.6.5.

Maxime

Le 2014-08-18 13:11, Ralph Castain a écrit :
Yep, that pinpointed the problem:

[helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING
[helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer 
[[63019,0],0] on socket 11
[helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect: 
connection failed: Connection refused (111)
[helios-login1:28558] [[63019,1],0] tcp_peer_close for [[63019,0],0] sd 11 
state CONNECTING
[helios-login1:28558] [[63019,1],0] tcp:lost connection called for peer 
[[63019,0],0]


The apps are trying to connect back to mpirun using the following addresses:

tcp://10.10.1.3,132.219.137.36,10.12.1.3:34237

The initial attempt is here

[helios-login1:28558] [[63019,1],0] orte_tcp_peer_try_connect: attempting to 
connect to proc [[63019,0],0] on 10.10.1.3:34237 - 0 retries

I know there is a failover bug in the 1.8 series, and so if that connection got 
rejected the proc would abort. Should we be using a different network? If so, 
telling us via the oob_tcp_if_include param would be the solution.


On Aug 18, 2014, at 10:04 AM, Maxime Boissonneault 
<maxime.boissonnea...@calculquebec.ca> wrote:

Here it is.

Maxime

Le 2014-08-18 12:59, Ralph Castain a écrit :
Ah...now that showed the problem. To pinpoint it better, please add

-mca oob_base_verbose 10

and I think we'll have it

On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault 
<maxime.boissonnea...@calculquebec.ca> wrote:

This is all one one node indeed.

Attached is the output of
mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca 
state_base_verbose 5 -mca errmgr_base_verbose 5  ring_c |& tee 
output_ringc_verbose.txt


Maxime

Le 2014-08-18 12:48, Ralph Castain a écrit :
This is all on one node, yes?

Try adding the following:

-mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5

Lot of garbage, but should tell us what is going on.

On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault 
<maxime.boissonnea...@calculquebec.ca> wrote:

Here it is
Le 2014-08-18 12:30, Joshua Ladd a écrit :
mpirun -np 4 --mca plm_base_verbose 10
[mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 ring_c
[helios-login1:27853] mca: base: components_register: registering plm components
[helios-login1:27853] mca: base: components_register: found loaded component 
isolated
[helios-login1:27853] mca: base: components_register: component isolated has no 
register or open function
[helios-login1:27853] mca: base: components_register: found loaded component rsh
[helios-login1:27853] mca: base: components_register: component rsh register 
function successful
[helios-login1:27853] mca: base: components_register: found loaded component tm
[helios-login1:27853] mca: base: components_register: component tm register 
function successful
[helios-login1:27853] mca: base: components_open: opening plm components
[helios-login1:27853] mca: base: components_open: found loaded component 
isolated
[helios-login1:27853] mca: base: components_open: component isolated open 
function successful
[helios-login1:27853] mca: base: components_open: found loaded component rsh
[helios-login1:27853] mca: base: components_open: component rsh open function 
successful
[helios-login1:27853] mca: base: components_open: found loaded component tm
[helios-login1:27853] mca: base: components_open: component tm open function 
successful
[helios-login1:27853] mca:base:select: Auto-selecting plm components
[helios-login1:27853] mca:base:select:(  plm) Querying component [isolated]
[helios-login1:27853] mca:base:select:(  plm) Query of component [isolated] set 
priority to 0
[helios-login1:27853] mca:base:select:(  plm) Querying component [rsh]
[helios-login1:27853] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[helios-login1:27853] mca:base:select:(  plm) Querying component [tm]
[helios-login1:27853] mca:base:select:(  plm) Skipping component [tm]. Query 
failed to return a module
[helios-login1:27853] mca:base:select:(  plm) Selected component [rsh]
[helios-login1:27853] mca: base: close: component isolated closed
[helios-login1:27853] mca: base: close: unloading component isolated
[helios-login1:27853] mca: base: close: component tm closed
[helios-login1:27853] mca: base: close: unloading component tm
[helios-login1:27853] mca: base: close: component rsh closed
[helios-login1:27853] mca: base: close: unloading component rsh
[mboisson@helios-login1 examples]$ echo $?
65


Maxime
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25052.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25053.php
--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

<output_ringc_verbose.txt.gz>_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25054.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25055.php
--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

<output_ringc_verbose2.txt.gz>_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25056.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25057.php

--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25058.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25059.php


--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Reply via email to