Re: [OMPI users] Multiple mpiexec's within a job (schedule within a scheduled machinefile/job allocation)

Ralph Castain Thu, 30 Jul 2009 16:07:48 -0400

Let me know how it goes, if you don't mind. It would be nice to knowif we actually met your needs, or if a tweak might help make it easier.


Thanks
Ralph


On Jul 30, 2009, at 1:36 PM, Adams, Brian M wrote:

Thanks Ralph, I wasn't aware of the relative indexing or sequentialmapper capabilities. I will check those out and report back if Istill have a feature request. -- Brian

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]On Behalf Of Ralph Castain

Sent: Thursday, July 30, 2009 12:26 PM
To: Open MPI Users

Subject: Re: [OMPI users] Multiple mpiexec's within a job (schedulewithin a scheduled machinefile/job allocation)



On Jul 30, 2009, at 11:49 AM, Adams, Brian M wrote:

Apologies if I'm being confusing; I'm probably trying to get atatypical use cases. M and N need not correspond to the number ofnodes/ppn nor ppn/nodes available. By node vs. slot doesn't muchmatter, as long as in the end I don't oversubscribe any node. Byslot might be good for efficiency in some apps, but I can't make ageneral case for it.
I think what you proposed offers some help in the case where N isan integer multiple of the number of available nodes, but perhapsnot in other cases. I must be missing something here, so insteadof being fully general, perhaps consider a specific case. Supposewe have 4 nodes, 8 ppn (32 slots is I think the ompi language). Imight want to schedule, for example
1. M=2 simultaneous N=16 processor jobs: Here I believe what yousuggested will work since N is a multiple of the available numberof nodes. I could use either npernode 4 or just bynode and I thinkget the same result: an even distribution of tasks. (similarapplies to, e.g., 8x4, 4x8)


Yes, agreed

2. M=16 simultaneous N=2 processor jobs: it seems if I use bynodeor npernode, I would end up with 16 processes on each of the firsttwo nodes (similar applies to, e.g., 32x1 or 10x3). Schedulingmany small jobs is a common problem for us.

3. M=3 simultaneous, N=10 processor jobs: I think we'd end up withthis distribution (where A-D are nodes and 0-2 jobs)
A 0 0 0 1 1 1 2 2 2
B 0 0 0 1 1 1 2 2 2
C 0 0   1 1   2 2
D 0 0   1 1   2 2
where A and B are over-subscribed and there are more than the twounused slots I'd expect in the whole allocation.
Again, I can manage all these via a script that partitions themachine files, just wondering which scenarios OpenMPI can manage.

Have you looked at the relative indexing in 1.3.3? You could specifyany of these in relative index terms, and have one "hostfile" thatwould support 16x2 operations. This would work then for anyallocation.


Your launch script could even just do it, something like this:

mpirun -n 2 -host +n0:1,+n1:1 app
mpirun -n 2 -host +n0:2,+n1:2 app

etc. Obviously, you could compute the relative indexing and juststick it in as required.

Likewise, you could use the new "seq" (sequential) mapper to achieveany desired layout, again utilizing relative indexing to avoidhaving to create a special hostfile for each run.

Note that in all cases, you can specify a -n N that will tell OMPIto only execute N processes, regardless of what is in the sequentialmapper file or -host.

If none of those work well, please let me know. I'm happy to createthe required capability as I'm sure LANL will use it too (know ofseveral similar cases here, but the current options seem okay forthem).


Thanks!
Brian

-----Original Message-----
From: users-boun...@open-mpi.org
[mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Wednesday, July 29, 2009 4:19 PM
To: Open MPI Users
Subject: Re: [OMPI users] Multiple mpiexec's within a job
(schedule within a scheduled machinefile/job allocation)

Oh my - that does take me back a long way! :-)

Do you need these processes to be mapped byslot (i.e., do you
care if the process ranks are sharing nodes)? If not, why not
add "-bynode" to your cmd line?

Alternatively, given the mapping you want, just do

mpirun -npernode 1 application.exe

This would launch one copy on each of your N nodes. So if you
fork M times, you'll wind up with the exact pattern you
wanted. And, as each one exits, you could immediately launch
a replacement without worrying about oversubscription.

Does that help?
Ralph

PS. we dropped that "persistent" operation - caused way too
many problems with cleanup and other things. :-)

On Jul 29, 2009, at 3:46 PM, Adams, Brian M wrote:

Hi Ralph (all),

I'm resurrecting this 2006 thread for a status check.  The

new 1.3.x

machinefile behavior is great (thanks!) -- I can use

machinefiles to

manage multiple simultaneous mpiruns within a single torque
allocation (where the hosts are a subset of $PBS_NODEFILE).
However, this requires some careful management of machinefiles.

I'm curious if OpenMPI now directly supports the behavior I need,
described in general in the quote below.  Specifically,

given a single

PBS/Torque allocation of M*N processors, I will run a

serial program

that will fork M times.  Each of the M forked processes
calls 'mpirun -np N application.exe' and blocks until completion.
This seems akin to the case you described of "mpiruns executed in
separate windows/prompts."

What I'd like to see is the M processes "tiled" across the

available

slots, so all M*N processors are used.  What I see instead

appears at

face value to be the first N resources being oversubscribed Mtimes.
Also, when one of the forked processes returns, I'd like to

be able to

spawn another and have its mpirun schedule on the resources

freed by

the previous one that exited.  Is any of this possible?

I tried starting an orted (1.3.3, roughly as you suggested

below), but

got this error:

orted --daemonize

[gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
runtime/orte_init.c at line 125

----------------------------------------------------------------------

---- It looks like orte_init failed for some reason; your parallel
process is likely to abort.  There are many reasons that a parallel
process can fail during orte_init; some of which are due to
configuration or environment problems.  This failure

appears to be an

internal failure; here's some additional information (which

may only

be relevant to an Open MPI developer):

orte_ess_base_select failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS

----------------------------------------------------------------------

---- [gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not

found in file

orted/orted_main.c at line 323

I spared the debugging info as I'm not even sure this is a correct
invocation...

Thanks for any suggestions you can offer!
Brian
----------
Brian M. Adams, PhD (bria...@sandia.gov) Optimization and

Uncertainty

Quantification Sandia National Laboratories, Albuquerque, NM
http://www.sandia.gov/~briadam

From: Ralph Castain (rhc_at_[hidden])
Date: 2006-12-12 00:46:59

Hi Chris

Some of this is doable with today's code....and one of these
behaviors is not. :-(

Open MPI/OpenRTE can be run in "persistent" mode - this allows
multiple jobs to share the same allocation. This works much as you
describe (syntax is slightly different, of
course!) - the first mpirun will map using whatever mode was
requested, then the next mpirun will map starting from where the
first one left off.

I *believe* you can run each mpirun in the background.
However, I don't know if this has really been tested enough to
support such a claim. All testing that I know about to-date has
executed mpirun in the foreground - thus, your example

would execute

sequentially instead of in parallel.


I know people have tested multiple mpirun's operating in parallel
within a single allocation (i.e., persistent mode) where

the mpiruns

are executed in separate windows/prompts.
So I suspect you could do something like you describe -

just haven't

personally verified it.


Where we definitely differ is that Open MPI/RTE will *not* block
until resources are freed up from the prior mpiruns.
Instead, we will attempt to execute each mpirun immediately - and
will error out the one(s) that try to execute without sufficient
resources. I imagine we could provide the kind of "flow

control" you

describe, but I'm not sure when that might happen.


I am (in my copious free time...haha) working on an "orteboot"
program that will startup a virtual machine to make the persistent
mode of operation a little easier. For now, though, you

can do it by:



1. starting up the "server" using the following command:
orted --seed --persistent --scope public [--universe foo]


2. do your mpirun commands. They will automagically find

the "server"

and connect to it. If you specified a universe name when

starting the

server, then you must specify the same universe name on

your mpirun

commands.


When you are done, you will have to (unfortunately)

manually "kill"

the server and remove its session directory. I have a

program called

"ortehalt"
in the trunk that will do this cleanly for you, but it

isn't yet in

the release distributions. You are welcome to use it,

though, if you

are working with the trunk - I can't promise it is

bulletproof yet,

but it seems to be working.


Ralph


On 12/11/06 8:07 PM, "Maestas, Christopher Daniel"
<cdmaest_at_[hidden]>
wrote:

Hello,

Sometimes we have users that like to do from within a single job
(think schedule within an job scheduler allocation):
"mpiexec -n X myprog"
"mpiexec -n Y myprog2"
Does mpiexec within Open MPI keep track of the node list it

is using

if it binds to a particular scheduler?
For example with 4 nodes (2ppn SMP):
"mpiexec -n 2 myprog"
"mpiexec -n 2 myprog2"
"mpiexec -n 1 myprog3"
And assume this is by-slot allocation we would have the following
allocation:
node1 - processor1 - myprog
- processor2 - myprog
node2 - processor1 - myprog2
- processor2 - myprog2
And for a by-node allocation:
node1 - processor1 - myprog
- processor2 - myprog2
node2 - processor1 - myprog
- processor2 - myprog2

I think this is possible using ssh cause it shouldn't

really matter

how many times it spawns, but with something like torque it

would get

restricted to a max process launch of 4. We would want the third
mpiexec to block processes and eventually be run on the first
available node allocation that frees up from myprog or

myprog2 ....


For example for torque, we had to add the following to

osc mpiexec:

---
Finally, since only one mpiexec can be the master at a

time, if your

code setup requires that mpiexec exit to get a result, you

can start a

"dummy"
mpiexec first in your batch
job:

mpiexec -server

It runs no tasks itself but handles the connections of

other transient

mpiexec clients.
It will shut down cleanly when the batch job exits or you

may kill the

server explicitly.
If the server is killed with SIGTERM (or HUP or INT), it

will exit

with a status of zero if there were no clients connected at

the time.

If there were still clients using the server, the server

will kill all

their tasks, disconnect from the clients, and exit with status 1.
---

So a user ran:
mpiexec -server
mpiexec -n 2 myprog
mpiexec -n 2 myprog2
And the server kept track of the allocation ... I would

think that the

orted could do this?

Sorry if this sounds confusing ... But I'm sure it will

clear up with

any further responses I make. :-) -cdm


_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Multiple mpiexec's within a job (schedule within a scheduled machinefile/job allocation)

Reply via email to