OK, I get it running if I specify /usr/local/bin/mpiexec instead of just
mpiexec.
Now, the program hangs at the first MPI_Waitall on the remote node.
The program runs just fine if both nodes are on the same machine.
Any ideas how to debug this ?
Many Thanks
Richard
----- Original Message -----
From: "Jeff Squyres" <jsquy...@cisco.com>
To: "Open MPI Users" <us...@open-mpi.org>
Sent: Tuesday, February 14, 2012 11:13 AM
Subject: Re: [OMPI users] MPI orte_init fails on remote nodes
Make sure that your LD_LIBRARY_PATH is being set in your shell startup files
for *non-interactive logins*.
For example, ensure that LD_LIBRARY_PATH is set properly, even in this case:
-----
ssh some-other-node env | grep LD_LIBRARY_PATH
-----
(note that this is different than "ssh some-other-node echo $LD_LIBRARY_PATH", because the "$LD_LIBRARY_PATH" will be evaluated on
the local node, even before ssh is invoked)
I mention this because some shell startup files distinguish between interactive and non-interactive logins; they sometimes
terminate early for non-interactive logins. Look for "exit" statements, or conditional blocks that are only invoked during
interactive logins, for example.
On Feb 14, 2012, at 5:40 AM, Richard Bardwell wrote:
Jeff,
I wiped out all versions of openmpi on all the nodes including the distro
installed version.
I reinstalled version 1.4.4 on all nodes.
I now get the error that libopen-rte.so.0 cannot be found when running mpiexec
across
different nodes, even though the LD_LIBRARY_PATH for all nodes points to
/usr/local/lib
where the file exists. Any ideas ?
Many Thanks
Richard
----- Original Message ----- From: "Jeff Squyres" <jsquy...@cisco.com>
To: "Open MPI Users" <us...@open-mpi.org>
Sent: Monday, February 13, 2012 6:28 PM
Subject: Re: [OMPI users] MPI orte_init fails on remote nodes
You might want to fully uninstall the disto-installed version of Open MPI on all the nodes (e.g., Red Hat may have installed a
different version of Open MPI, and that version is being found in your $PATH before your custom-installedversion).
On Feb 13, 2012, at 12:12 PM, Richard Bardwell wrote:
OK, 1.4.4 is happily installed on both machines. But, I now get a really
weird error when running on the 2 nodes. I get
Error: unknown option "--daemonize"
even though I am just running with -np 2 -hostfile test.hst
The program runs fine on 2 cores if running locally on each node.
Any ideas ??
Thanks
Richard
----- Original Message ----- From: "Gustavo Correa" <g...@ldeo.columbia.edu>
To: "Open MPI Users" <us...@open-mpi.org>
Sent: Monday, February 13, 2012 4:22 PM
Subject: Re: [OMPI users] MPI orte_init fails on remote nodes
On Feb 13, 2012, at 11:02 AM, Richard Bardwell wrote:
Ralph
I had done a make clean in the 1.2.8 directory if that is what you meant ?
Or do I need to do something else ?
I appreciate your help on this by the way ;-)
Hi Richard
You can install in a different directory, totally separate from 1.2.8.
Create a new work directory [which is not the final installation directory,
just work, say /tmp/openmpi/1.4.4/work].
Launch the OpenMPI 1.4.4 configure script from this new work directory with the --prefix pointing to your desired installation
directory [e.g. /home/richard/openmpi/1.4.4/].
I am assuming this is NFS mounted on the nodes [if you have a cluster].
[Check all options with 'configure --help'.]
Then do make, make install.
Finally set your PATH and LD_LIBRARY_PATH to point to the new installation
directory,
to prevent mixing with the old 1.2.8.
I have a number of OpenMPI versions here, compiled with various compilers,
and they coexist well this way.
I hope this helps,
Gus Correa
----- Original Message -----
From: Ralph Castain
To: Open MPI Users
Sent: Monday, February 13, 2012 3:41 PM
Subject: Re: [OMPI users] MPI orte_init fails on remote nodes
You need to clean out the old attempt - that is a stale file
Sent from my iPad
On Feb 13, 2012, at 7:36 AM, "Richard Bardwell" <rich...@sharc.co.uk> wrote:
OK, I installed 1.4.4, rebuilt the exec and guess what ...... I now get some
weird errors as below:
mca: base: component_find: unable to open
/usr/local/lib/openmpi/mca_ras_dash_host
along with a few other files
even though the .so / .la files are all there !
----- Original Message -----
From: Ralph Castain
To: Open MPI Users
Sent: Monday, February 13, 2012 2:59 PM
Subject: Re: [OMPI users] MPI orte_init fails on remote nodes
Good heavens - where did you find something that old? Can you use a more recent
version?
Sent from my iPad
Gentlemen
I am struggling to get MPI working when the hostfile contains different nodes.
I get the error below. Any ideas ?? I can ssh without password between the two
nodes. I am running 1.2.8 MPI on both machines.
Any help most appreciated !!!!!
MPITEST/v8_mpi_test> mpiexec -n 2 --debug-daemons -hostfile test.hst
/home/sharc/MPITEST/v8_mpi_test/mpitest
Daemon [0,0,1] checking in as pid 10490 on host 192.0.2.67
[linux-z0je:08804] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_init_stage1.c at line 182
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_rml_base_select failed
--> Returned value -13 instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[linux-z0je:08804] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_system_init.c at line 42
[linux-z0je:08804] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_init.c at line 52
Open RTE was unable to initialize properly. The error occured while
attempting to orte_init(). Returned value -13 instead of ORTE_SUCCESS.
[linux-tmpw:10490] [0,0,1] orted_recv_pls: received message from [0,0,0]
[linux-tmpw:10490] [0,0,1] orted_recv_pls: received kill_local_procs
[linux-tmpw:10489] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[linux-tmpw:10489] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
line 1158
[linux-tmpw:10489] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line
90
[linux-tmpw:10489] ERROR: A daemon on node 192.0.2.68 failed to start as
expected.
[linux-tmpw:10489] ERROR: There may be more information available from
[linux-tmpw:10489] ERROR: the remote shell (see above).
[linux-tmpw:10489] ERROR: The daemon exited unexpectedly with status 243.
[linux-tmpw:10490] [0,0,1] orted_recv_pls: received message from [0,0,0]
[linux-tmpw:10490] [0,0,1] orted_recv_pls: received exit
[linux-tmpw:10489] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[linux-tmpw:10489] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
line 1190
--------------------------------------------------------------------------
mpiexec was unable to cleanly terminate the daemons for this job. Returned
value Timeout instead of ORTE_SUCCESS.
--------------------------------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users