Indeed, that makes sense now.
Why isn't OpenMPI attempting to connect with the local loop for same
node ? This used to work with 1.6.5.
Maxime
Le 2014-08-18 13:11, Ralph Castain a écrit :
Yep, that pinpointed the problem:
[helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING
[helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer
[[63019,0],0] on socket 11
[helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect:
connection failed: Connection refused (111)
[helios-login1:28558] [[63019,1],0] tcp_peer_close for [[63019,0],0] sd 11
state CONNECTING
[helios-login1:28558] [[63019,1],0] tcp:lost connection called for peer
[[63019,0],0]
The apps are trying to connect back to mpirun using the following addresses:
tcp://10.10.1.3,132.219.137.36,10.12.1.3:34237
The initial attempt is here
[helios-login1:28558] [[63019,1],0] orte_tcp_peer_try_connect: attempting to
connect to proc [[63019,0],0] on 10.10.1.3:34237 - 0 retries
I know there is a failover bug in the 1.8 series, and so if that connection got
rejected the proc would abort. Should we be using a different network? If so,
telling us via the oob_tcp_if_include param would be the solution.
On Aug 18, 2014, at 10:04 AM, Maxime Boissonneault
<maxime.boissonnea...@calculquebec.ca> wrote:
Here it is.
Maxime
Le 2014-08-18 12:59, Ralph Castain a écrit :
Ah...now that showed the problem. To pinpoint it better, please add
-mca oob_base_verbose 10
and I think we'll have it
On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault
<maxime.boissonnea...@calculquebec.ca> wrote:
This is all one one node indeed.
Attached is the output of
mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca
state_base_verbose 5 -mca errmgr_base_verbose 5 ring_c |& tee
output_ringc_verbose.txt
Maxime
Le 2014-08-18 12:48, Ralph Castain a écrit :
This is all on one node, yes?
Try adding the following:
-mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5
Lot of garbage, but should tell us what is going on.
On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault
<maxime.boissonnea...@calculquebec.ca> wrote:
Here it is
Le 2014-08-18 12:30, Joshua Ladd a écrit :
mpirun -np 4 --mca plm_base_verbose 10
[mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 ring_c
[helios-login1:27853] mca: base: components_register: registering plm components
[helios-login1:27853] mca: base: components_register: found loaded component
isolated
[helios-login1:27853] mca: base: components_register: component isolated has no
register or open function
[helios-login1:27853] mca: base: components_register: found loaded component rsh
[helios-login1:27853] mca: base: components_register: component rsh register
function successful
[helios-login1:27853] mca: base: components_register: found loaded component tm
[helios-login1:27853] mca: base: components_register: component tm register
function successful
[helios-login1:27853] mca: base: components_open: opening plm components
[helios-login1:27853] mca: base: components_open: found loaded component
isolated
[helios-login1:27853] mca: base: components_open: component isolated open
function successful
[helios-login1:27853] mca: base: components_open: found loaded component rsh
[helios-login1:27853] mca: base: components_open: component rsh open function
successful
[helios-login1:27853] mca: base: components_open: found loaded component tm
[helios-login1:27853] mca: base: components_open: component tm open function
successful
[helios-login1:27853] mca:base:select: Auto-selecting plm components
[helios-login1:27853] mca:base:select:( plm) Querying component [isolated]
[helios-login1:27853] mca:base:select:( plm) Query of component [isolated] set
priority to 0
[helios-login1:27853] mca:base:select:( plm) Querying component [rsh]
[helios-login1:27853] mca:base:select:( plm) Query of component [rsh] set
priority to 10
[helios-login1:27853] mca:base:select:( plm) Querying component [tm]
[helios-login1:27853] mca:base:select:( plm) Skipping component [tm]. Query
failed to return a module
[helios-login1:27853] mca:base:select:( plm) Selected component [rsh]
[helios-login1:27853] mca: base: close: component isolated closed
[helios-login1:27853] mca: base: close: unloading component isolated
[helios-login1:27853] mca: base: close: component tm closed
[helios-login1:27853] mca: base: close: unloading component tm
[helios-login1:27853] mca: base: close: component rsh closed
[helios-login1:27853] mca: base: close: unloading component rsh
[mboisson@helios-login1 examples]$ echo $?
65
Maxime
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/08/25052.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/08/25053.php
--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique
<output_ringc_verbose.txt.gz>_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/08/25054.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/08/25055.php
--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique
<output_ringc_verbose2.txt.gz>_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/08/25056.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/08/25057.php
--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique