Re: [OMPI users] Multiple mpiexec's within a job (schedule within a scheduled machinefile/job allocation)

Ralph Castain Wed, 29 Jul 2009 18:19:41 -0400

Oh my - that does take me back a long way! :-)

Do you need these processes to be mapped byslot (i.e., do you care ifthe process ranks are sharing nodes)? If not, why not add "-bynode" toyour cmd line?


Alternatively, given the mapping you want, just do

mpirun -npernode 1 application.exe

This would launch one copy on each of your N nodes. So if you fork Mtimes, you'll wind up with the exact pattern you wanted. And, as eachone exits, you could immediately launch a replacement without worryingabout oversubscription.


Does that help?
Ralph

PS. we dropped that "persistent" operation - caused way too manyproblems with cleanup and other things. :-)


On Jul 29, 2009, at 3:46 PM, Adams, Brian M wrote:

Hi Ralph (all),

I'm resurrecting this 2006 thread for a status check. The new 1.3.xmachinefile behavior is great (thanks!) -- I can use machinefiles tomanage multiple simultaneous mpiruns within a single torqueallocation (where the hosts are a subset of $PBS_NODEFILE).However, this requires some careful management of machinefiles.

I'm curious if OpenMPI now directly supports the behavior I need,described in general in the quote below. Specifically, given asingle PBS/Torque allocation of M*N processors, I will run a serialprogram that will fork M times. Each of the M forked processescalls 'mpirun -np N application.exe' and blocks until completion.This seems akin to the case you described of "mpiruns executed inseparate windows/prompts."

What I'd like to see is the M processes "tiled" across the availableslots, so all M*N processors are used. What I see instead appearsat face value to be the first N resources being oversubscribed Mtimes.

Also, when one of the forked processes returns, I'd like to be ableto spawn another and have its mpirun schedule on the resources freedby the previous one that exited. Is any of this possible?

I tried starting an orted (1.3.3, roughly as you suggested below),but got this error:

orted --daemonize

[gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in fileruntime/orte_init.c at line 125

--------------------------------------------------------------------------

It looks like orte_init failed for some reason; your parallelprocess is

likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

 orte_ess_base_select failed
 --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------

[gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in fileorted/orted_main.c at line 323

I spared the debugging info as I'm not even sure this is a correctinvocation...


Thanks for any suggestions you can offer!
Brian
----------
Brian M. Adams, PhD ([email protected])
Optimization and Uncertainty Quantification
Sandia National Laboratories, Albuquerque, NM
http://www.sandia.gov/~briadam

From: Ralph Castain (rhc_at_[hidden])
Date: 2006-12-12 00:46:59

Hi Chris

Some of this is doable with today's code....and one of these
behaviors is not. :-(

Open MPI/OpenRTE can be run in "persistent" mode - this
allows multiple jobs to share the same allocation. This works
much as you describe (syntax is slightly different, of
course!) - the first mpirun will map using whatever mode was
requested, then the next mpirun will map starting from where
the first one left off.

I *believe* you can run each mpirun in the background.
However, I don't know if this has really been tested enough
to support such a claim. All testing that I know about
to-date has executed mpirun in the foreground - thus, your
example would execute sequentially instead of in parallel.

I know people have tested multiple mpirun's operating in
parallel within a single allocation (i.e., persistent mode)
where the mpiruns are executed in separate windows/prompts.
So I suspect you could do something like you describe - just
haven't personally verified it.

Where we definitely differ is that Open MPI/RTE will *not*
block until resources are freed up from the prior mpiruns.
Instead, we will attempt to execute each mpirun immediately -
and will error out the one(s) that try to execute without
sufficient resources. I imagine we could provide the kind of
"flow control" you describe, but I'm not sure when that might happen.

I am (in my copious free time...haha) working on an
"orteboot" program that will startup a virtual machine to
make the persistent mode of operation a little easier. For
now, though, you can do it by:

1. starting up the "server" using the following command:
orted --seed --persistent --scope public [--universe foo]

2. do your mpirun commands. They will automagically find the
"server" and connect to it. If you specified a universe name
when starting the server, then you must specify the same
universe name on your mpirun commands.

When you are done, you will have to (unfortunately) manually
"kill" the server and remove its session directory. I have a
program called "ortehalt"
in the trunk that will do this cleanly for you, but it isn't
yet in the release distributions. You are welcome to use it,
though, if you are working with the trunk - I can't promise
it is bulletproof yet, but it seems to be working.

Ralph

On 12/11/06 8:07 PM, "Maestas, Christopher Daniel"
<cdmaest_at_[hidden]>
wrote:

Hello,

Sometimes we have users that like to do from within a single job
(think schedule within an job scheduler allocation):
"mpiexec -n X myprog"
"mpiexec -n Y myprog2"
Does mpiexec within Open MPI keep track of the node list it

is using

if it binds to a particular scheduler?
For example with 4 nodes (2ppn SMP):
"mpiexec -n 2 myprog"
"mpiexec -n 2 myprog2"
"mpiexec -n 1 myprog3"
And assume this is by-slot allocation we would have the following
allocation:
node1 - processor1 - myprog
- processor2 - myprog
node2 - processor1 - myprog2
- processor2 - myprog2
And for a by-node allocation:
node1 - processor1 - myprog
- processor2 - myprog2
node2 - processor1 - myprog
- processor2 - myprog2

I think this is possible using ssh cause it shouldn't really matter
how many times it spawns, but with something like torque it

would get

restricted to a max process launch of 4. We would want the third
mpiexec to block processes and eventually be run on the first
available node allocation that frees up from myprog or myprog2 ....

For example for torque, we had to add the following to osc mpiexec:
---
Finally, since only one mpiexec can be the master at a

time, if your

code setup requires that mpiexec exit to get a result, you

can start a

"dummy"
mpiexec first in your batch
job:

mpiexec -server

It runs no tasks itself but handles the connections of

other transient

mpiexec clients.
It will shut down cleanly when the batch job exits or you

may kill the

server explicitly.
If the server is killed with SIGTERM (or HUP or INT), it will exit
with a status of zero if there were no clients connected at

the time.

If there were still clients using the server, the server

will kill all

their tasks, disconnect from the clients, and exit with status 1.
---

So a user ran:
mpiexec -server
mpiexec -n 2 myprog
mpiexec -n 2 myprog2
And the server kept track of the allocation ... I would

think that the

orted could do this?

Sorry if this sounds confusing ... But I'm sure it will

clear up with

any further responses I make. :-) -cdm


_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Multiple mpiexec's within a job (schedule within a scheduled machinefile/job allocation)

Reply via email to