Ok, I confirm that with
mpiexec -mca oob_tcp_if_include lo ring_c
it works.
It also works with
mpiexec -mca oob_tcp_if_include ib0 ring_c
We have 4 interfaces on this node.
- lo, the local loop
- ib0, infiniband
- eth2, a management network
- eth3, the public network
It seems that mpiexec attempts to use the two addresses that do not work
(eth2, eth3) and does not use the two that do work (ib0 and lo).
However, according to the logs sent previously, it does see ib0 (despite
not seeing lo), but does not attempt to use it.
On the compute nodes, we have eth0 (management), ib0 and lo, and it
works. I am unsure why it does work on the compute nodes and not on the
login nodes. The only difference is the presence of a public interface
on the login node.
Maxime
Le 2014-08-18 13:37, Ralph Castain a écrit :
Yeah, there are some issues with the internal connection logic that need to get
fixed. We haven't had many cases where it's been an issue, but a couple like
this have cropped up - enough that I need to set aside some time to fix it.
My apologies for the problem.
On Aug 18, 2014, at 10:31 AM, Maxime Boissonneault
<maxime.boissonnea...@calculquebec.ca> wrote:
Indeed, that makes sense now.
Why isn't OpenMPI attempting to connect with the local loop for same node ?
This used to work with 1.6.5.
Maxime
Le 2014-08-18 13:11, Ralph Castain a écrit :
Yep, that pinpointed the problem:
[helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING
[helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer
[[63019,0],0] on socket 11
[helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect:
connection failed: Connection refused (111)
[helios-login1:28558] [[63019,1],0] tcp_peer_close for [[63019,0],0] sd 11
state CONNECTING
[helios-login1:28558] [[63019,1],0] tcp:lost connection called for peer
[[63019,0],0]
The apps are trying to connect back to mpirun using the following addresses:
tcp://10.10.1.3,132.219.137.36,10.12.1.3:34237
The initial attempt is here
[helios-login1:28558] [[63019,1],0] orte_tcp_peer_try_connect: attempting to
connect to proc [[63019,0],0] on 10.10.1.3:34237 - 0 retries
I know there is a failover bug in the 1.8 series, and so if that connection got
rejected the proc would abort. Should we be using a different network? If so,
telling us via the oob_tcp_if_include param would be the solution.
On Aug 18, 2014, at 10:04 AM, Maxime Boissonneault
<maxime.boissonnea...@calculquebec.ca> wrote:
Here it is.
Maxime
Le 2014-08-18 12:59, Ralph Castain a écrit :
Ah...now that showed the problem. To pinpoint it better, please add
-mca oob_base_verbose 10
and I think we'll have it
On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault
<maxime.boissonnea...@calculquebec.ca> wrote:
This is all one one node indeed.
Attached is the output of
mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca
state_base_verbose 5 -mca errmgr_base_verbose 5 ring_c |& tee
output_ringc_verbose.txt
Maxime
Le 2014-08-18 12:48, Ralph Castain a écrit :
This is all on one node, yes?
Try adding the following:
-mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5
Lot of garbage, but should tell us what is going on.
On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault
<maxime.boissonnea...@calculquebec.ca> wrote:
Here it is
Le 2014-08-18 12:30, Joshua Ladd a écrit :
mpirun -np 4 --mca plm_base_verbose 10
[mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 ring_c
[helios-login1:27853] mca: base: components_register: registering plm components
[helios-login1:27853] mca: base: components_register: found loaded component
isolated
[helios-login1:27853] mca: base: components_register: component isolated has no
register or open function
[helios-login1:27853] mca: base: components_register: found loaded component rsh
[helios-login1:27853] mca: base: components_register: component rsh register
function successful
[helios-login1:27853] mca: base: components_register: found loaded component tm
[helios-login1:27853] mca: base: components_register: component tm register
function successful
[helios-login1:27853] mca: base: components_open: opening plm components
[helios-login1:27853] mca: base: components_open: found loaded component
isolated
[helios-login1:27853] mca: base: components_open: component isolated open
function successful
[helios-login1:27853] mca: base: components_open: found loaded component rsh
[helios-login1:27853] mca: base: components_open: component rsh open function
successful
[helios-login1:27853] mca: base: components_open: found loaded component tm
[helios-login1:27853] mca: base: components_open: component tm open function
successful
[helios-login1:27853] mca:base:select: Auto-selecting plm components
[helios-login1:27853] mca:base:select:( plm) Querying component [isolated]
[helios-login1:27853] mca:base:select:( plm) Query of component [isolated] set
priority to 0
[helios-login1:27853] mca:base:select:( plm) Querying component [rsh]
[helios-login1:27853] mca:base:select:( plm) Query of component [rsh] set
priority to 10
[helios-login1:27853] mca:base:select:( plm) Querying component [tm]
[helios-login1:27853] mca:base:select:( plm) Skipping component [tm]. Query
failed to return a module
[helios-login1:27853] mca:base:select:( plm) Selected component [rsh]
[helios-login1:27853] mca: base: close: component isolated closed
[helios-login1:27853] mca: base: close: unloading component isolated
[helios-login1:27853] mca: base: close: component tm closed
[helios-login1:27853] mca: base: close: unloading component tm
[helios-login1:27853] mca: base: close: component rsh closed
[helios-login1:27853] mca: base: close: unloading component rsh
[mboisson@helios-login1 examples]$ echo $?
65
Maxime
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/08/25052.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/08/25053.php
--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique
<output_ringc_verbose.txt.gz>_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/08/25054.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/08/25055.php
--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique
<output_ringc_verbose2.txt.gz>_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/08/25056.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/08/25057.php
--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/08/25058.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/08/25059.php
--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique