Yes - you don't want to use orte_launch_agent at all for that
purpose. What
you need to set is an info_key in your comm_spawn command for
"ompi_prefix",
with the value set to the install path. The ssh launcher will
assemble the
launch cmd using that info.
Ralph
On Sep 24, 2008, at 1:28 PM, Will Portnoy wrote:
Yes, your first sentence is correct. I intend to use the unmodified
orted, but I need to set up the unix environment after the ssh has
completed but before orted is executed.
In particular, one of the more important tasks for me to do after ssh
connects is to set LD_LIBRARY_PATH and PATH to include the paths of
the openmpi's install lib and bin directories, respectively.
Otherwise, orted will not be on the PATH, and its dependent libraries
will not be in LD_LIBRARY_PATH.
Is there a recommended method to set LD_LIBRARY_PATH and PATH when
ssh
is used to connect to other hosts when running an mpi job?
thank you,
Will
On Wed, Sep 24, 2008 at 2:36 PM, Ralph Castain <r...@lanl.gov> wrote:
So this is a singleton comm_spawn scenario, that requires you
specify a
launch_agent to execute? Just trying to ensure I understand.
First, let me ensure we have a common understanding of what
orte_launch_agent does. Basically, that param stipulates the
command to be
used in place of "orted" - it doesn't substitute for "ssh". So if
you set
-mca orte_launch_agent foo, what will happen is: "ssh nodename foo"
instead
of "ssh nodename orted".
The intent was to provide a way to do things like run valgrind on
the orted
itself. So you could do -mca orte_launch_agent "valgrind orted",
and we
would dutifully run "ssh nodename valrind orted".
Or if you wanted to write your own orted (e.g., bar-orted), you could
substitute it for our "orted".
Or if you wanted to set mca params solely to be seen on the backend
nodes/procs, you could set -mca orte_launch_agent "orted -mca foo
bar", and
we would launch "ssh nodename orted -mca foo bar". This allows us
to set mca
params without having mpirun see them - helps us to look at debug
output,
for example, from only the backend procs.
If what you need to do is set something in the environment for the
orted,
there are certain cmd line options that will do that for you -
orte_launch_agent may or may not be a good method.
Perhaps it would help if you could tell me exactly what you wanted
to have
orte_launch_agent actually do?
Thanks
Ralph
On Sep 24, 2008, at 12:22 PM, Will Portnoy wrote:
Sorry for the miscommunication: The processes are started by my
program with MPI_Comm_spawn, so there was no mpirun involved.
If you can suggest a test program I can use with mpirun to validate
my
openmpi environment and install, that would probably produce the
output you would like to see.
But I'm not sure that will make it clear how the file pointed to by
"orte_launch_agent" in "mca-params.conf" should be written to setup
an
environment and start orted.
Will
On Wed, Sep 24, 2008 at 2:17 PM, Ralph Castain <r...@lanl.gov> wrote:
Afraid I am confused. This was the entire output from the job?? If
so,
then
that means mpirun itself wasn't able to find a launch environment it
could
use, so you never got to the point of actually launching an orted.
Do you have ssh in your path? My best immediate guess is that you
don't,
and
that mpirun therefore doesn't see anything it can use to launch a
job. We
have discussed internally that we need to improve that error
message -
could
be this is another case emphasizing that point.
1.3 is fine to use - still patching some bugs, but nothing that
should
impact this issue.
Ralph
On Sep 24, 2008, at 12:11 PM, Will Portnoy wrote:
That was the output with plm_base_verbose set to 99 - it's the same
output with 1.
Yes, I'd like to use ssh.
orted wasn't starting properly with orte_launch_agent (which was
needed because my environment on the target machine wasn't set up),
so
that's why I thought I would try it directly on the command line on
localhost. I thought this was a simpler case: to verify that orted
could find all of its necessary components without the complexity of
everything else I'm doing.
If I needed to use orte_launch_agent, how should I pass the necessary
parameters to start orted after I set up my environment?
Am I better off using trunk over 1.3?
thank you,
Will
On Wed, Sep 24, 2008 at 2:01 PM, Ralph Castain <r...@lanl.gov> wrote:
Could you rerun that with -mca plm_base_verbose 1? What environment
are
you
in - I assume rsh/ssh?
I would like to see the cmd line being used to launch the orted. What
this
indicates is that we are not getting the cmd line correct. Could just
be
that some patch in the trunk didn't get completely applied to the 1.3
branch.
BTW: you probably can't run orted directly off of the cmd line. It
likely
needs some cmd line params to get critical info.
Ralph
On Sep 24, 2008, at 9:47 AM, Will Portnoy wrote:
I'm trying to use MPI_Comm_Spawn with MPI_Info's host key to spawn
processes from a process not started with mpirun. This works with
the
host key set to the localhost's hostname, but it does not work when I
use other hosts.
I'm using version 1.3a1r19602. I need to use orte_launch_agent to
set
up my environment a bit before orted is started, but it fails with
errors listed below.
When I try to run orted directly on the command line with some of the
verbosity flags turned to "11", I receive the same messages.
Does anybody have any suggestions?
thank you,
Will
[fqdn:24761] mca: base: components_open: Looking for ess components
[fqdn:24761] mca: base: components_open: opening ess components
[fqdn:24761] mca: base: components_open: found loaded component env
[fqdn:24761] mca: base: components_open: component env has no
register
function
[fqdn:24761] mca: base: components_open: component env open function
successful
[fqdn:24761] mca: base: components_open: found loaded component hnp
[fqdn:24761] mca: base: components_open: component hnp has no
register
function
[fqdn:24761] mca: base: components_open: component hnp open function
successful
[fqdn:24761] mca: base: components_open: found loaded component
singleton
[fqdn:24761] mca: base: components_open: component singleton has no
register function
[fqdn:24761] mca: base: components_open: component singleton open
function successful
[fqdn:24761] mca: base: components_open: found loaded component slurm
[fqdn:24761] mca: base: components_open: component slurm has no
register function
[fqdn:24761] mca: base: components_open: component slurm open
function
successful
[fqdn:24761] mca: base: components_open: found loaded component tool
[fqdn:24761] mca: base: components_open: component tool has no
register
function
[fqdn:24761] mca: base: components_open: component tool open function
successful
[fqdn:24761] mca:base:select: Auto-selecting ess components
[fqdn:24761] mca:base:select:( ess) Querying component [env]
[fqdn:24761] mca:base:select:( ess) Skipping component [env]. Query
failed to return a module
[fqdn:24761] mca:base:select:( ess) Querying component [hnp]
[fqdn:24761] mca:base:select:( ess) Skipping component [hnp]. Query
failed to return a module
[fqdn:24761] mca:base:select:( ess) Querying component [singleton]
[fqdn:24761] mca:base:select:( ess) Skipping component [singleton].
Query failed to return a module
[fqdn:24761] mca:base:select:( ess) Querying component [slurm]
[fqdn:24761] mca:base:select:( ess) Skipping component [slurm].
Query
failed to return a module
[fqdn:24761] mca:base:select:( ess) Querying component [tool]
[fqdn:24761] mca:base:select:( ess) Skipping component [tool]. Query
failed to return a module
[fqdn:24761] mca:base:select:( ess) No component selected!
[fqdn:24761] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
runtime/orte_init.c at line 125
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process
is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal
failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_ess_base_select failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[fqdn:24761] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
orted/orted_main.c at line 315
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users