Re: [OMPI users] Multiple mpiexec's within a job (schedule within a scheduled machinefile/job allocation)

Adams, Brian M Thu, 30 Jul 2009 13:49:31 -0400

Apologies if I'm being confusing; I'm probably trying to get at atypical use 
cases.  M and N  need not correspond to the number of nodes/ppn nor ppn/nodes 
available.  By node vs. slot doesn't much matter, as long as in the end I don't 
oversubscribe any node.  By slot might be good for efficiency in some apps, but 
I can't make a general case for it.


I think what you proposed offers some help in the case where N is an integer 
multiple of the number of available nodes, but perhaps not in other cases.  I 
must be missing something here, so instead of being fully general, perhaps 
consider a  specific case.  Suppose we have 4 nodes, 8 ppn (32 slots is I think 
the ompi language).  I might want to schedule, for example

1. M=2 simultaneous N=16 processor jobs: Here I believe what you suggested will 
work since N is a multiple of the available number of nodes.  I could use 
either npernode 4 or just bynode and I think get the same result: an even 
distribution of tasks.  (similar applies to, e.g., 8x4, 4x8)

2. M=16 simultaneous N=2 processor jobs: it seems if I use bynode or npernode, 
I would end up with 16 processes on each of the first two nodes (similar 
applies to, e.g., 32x1 or 10x3).  Scheduling many small jobs is a common 
problem for us.

3. M=3 simultaneous, N=10 processor jobs: I think we'd end up with this 
distribution (where A-D are nodes and 0-2 jobs)

A 0 0 0 1 1 1 2 2 2
B 0 0 0 1 1 1 2 2 2
C 0 0   1 1   2 2
D 0 0   1 1   2 2

where A and B are over-subscribed and there are more than the two unused slots 
I'd expect in the whole allocation.

Again, I can manage all these via a script that partitions the machine files, 
just wondering which scenarios OpenMPI can manage.

Thanks!
Brian

> -----Original Message-----
> From: users-boun...@open-mpi.org 
> [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Wednesday, July 29, 2009 4:19 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Multiple mpiexec's within a job 
> (schedule within a scheduled machinefile/job allocation)
> 
> Oh my - that does take me back a long way! :-)
> 
> Do you need these processes to be mapped byslot (i.e., do you 
> care if the process ranks are sharing nodes)? If not, why not 
> add "-bynode" to your cmd line?
> 
> Alternatively, given the mapping you want, just do
> 
> mpirun -npernode 1 application.exe
> 
> This would launch one copy on each of your N nodes. So if you 
> fork M times, you'll wind up with the exact pattern you 
> wanted. And, as each one exits, you could immediately launch 
> a replacement without worrying about oversubscription.
> 
> Does that help?
> Ralph
> 
> PS. we dropped that "persistent" operation - caused way too 
> many problems with cleanup and other things. :-)
> 
> On Jul 29, 2009, at 3:46 PM, Adams, Brian M wrote:
> 
> > Hi Ralph (all),
> >
> > I'm resurrecting this 2006 thread for a status check.  The 
> new 1.3.x 
> > machinefile behavior is great (thanks!) -- I can use 
> machinefiles to 
> > manage multiple simultaneous mpiruns within a single torque
> > allocation (where the hosts are a subset of $PBS_NODEFILE).   
> > However, this requires some careful management of machinefiles.
> >
> > I'm curious if OpenMPI now directly supports the behavior I need, 
> > described in general in the quote below.  Specifically, 
> given a single 
> > PBS/Torque allocation of M*N processors, I will run a 
> serial program 
> > that will fork M times.  Each of the M forked processes
> > calls 'mpirun -np N application.exe' and blocks until completion.   
> > This seems akin to the case you described of "mpiruns executed in 
> > separate windows/prompts."
> >
> > What I'd like to see is the M processes "tiled" across the 
> available 
> > slots, so all M*N processors are used.  What I see instead 
> appears at 
> > face value to be the first N resources being oversubscribed M times.
> >
> > Also, when one of the forked processes returns, I'd like to 
> be able to 
> > spawn another and have its mpirun schedule on the resources 
> freed by 
> > the previous one that exited.  Is any of this possible?
> >
> > I tried starting an orted (1.3.3, roughly as you suggested 
> below), but 
> > got this error:
> >
> >> orted --daemonize
> > [gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
> > runtime/orte_init.c at line 125
> > 
> ----------------------------------------------------------------------
> > ---- It looks like orte_init failed for some reason; your parallel 
> > process is likely to abort.  There are many reasons that a parallel 
> > process can fail during orte_init; some of which are due to 
> > configuration or environment problems.  This failure 
> appears to be an 
> > internal failure; here's some additional information (which 
> may only 
> > be relevant to an Open MPI developer):
> >
> >  orte_ess_base_select failed
> >  --> Returned value Not found (-13) instead of ORTE_SUCCESS
> > 
> ----------------------------------------------------------------------
> > ---- [gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not 
> found in file 
> > orted/orted_main.c at line 323
> >
> > I spared the debugging info as I'm not even sure this is a correct 
> > invocation...
> >
> > Thanks for any suggestions you can offer!
> > Brian
> > ----------
> > Brian M. Adams, PhD (bria...@sandia.gov) Optimization and 
> Uncertainty 
> > Quantification Sandia National Laboratories, Albuquerque, NM 
> > http://www.sandia.gov/~briadam
> >
> >
> >> From: Ralph Castain (rhc_at_[hidden])
> >> Date: 2006-12-12 00:46:59
> >>
> >> Hi Chris
> >>
> >>
> >> Some of this is doable with today's code....and one of these 
> >> behaviors is not. :-(
> >>
> >>
> >> Open MPI/OpenRTE can be run in "persistent" mode - this allows 
> >> multiple jobs to share the same allocation. This works much as you 
> >> describe (syntax is slightly different, of
> >> course!) - the first mpirun will map using whatever mode was 
> >> requested, then the next mpirun will map starting from where the 
> >> first one left off.
> >>
> >>
> >> I *believe* you can run each mpirun in the background.
> >> However, I don't know if this has really been tested enough to 
> >> support such a claim. All testing that I know about to-date has 
> >> executed mpirun in the foreground - thus, your example 
> would execute 
> >> sequentially instead of in parallel.
> >>
> >>
> >> I know people have tested multiple mpirun's operating in parallel 
> >> within a single allocation (i.e., persistent mode) where 
> the mpiruns 
> >> are executed in separate windows/prompts.
> >> So I suspect you could do something like you describe - 
> just haven't 
> >> personally verified it.
> >>
> >>
> >> Where we definitely differ is that Open MPI/RTE will *not* block 
> >> until resources are freed up from the prior mpiruns.
> >> Instead, we will attempt to execute each mpirun immediately - and 
> >> will error out the one(s) that try to execute without sufficient 
> >> resources. I imagine we could provide the kind of "flow 
> control" you 
> >> describe, but I'm not sure when that might happen.
> >>
> >>
> >> I am (in my copious free time...haha) working on an "orteboot" 
> >> program that will startup a virtual machine to make the persistent 
> >> mode of operation a little easier. For now, though, you 
> can do it by:
> >>
> >>
> >> 1. starting up the "server" using the following command:
> >> orted --seed --persistent --scope public [--universe foo]
> >>
> >>
> >> 2. do your mpirun commands. They will automagically find 
> the "server" 
> >> and connect to it. If you specified a universe name when 
> starting the 
> >> server, then you must specify the same universe name on 
> your mpirun 
> >> commands.
> >>
> >>
> >> When you are done, you will have to (unfortunately) 
> manually "kill" 
> >> the server and remove its session directory. I have a 
> program called 
> >> "ortehalt"
> >> in the trunk that will do this cleanly for you, but it 
> isn't yet in 
> >> the release distributions. You are welcome to use it, 
> though, if you 
> >> are working with the trunk - I can't promise it is 
> bulletproof yet, 
> >> but it seems to be working.
> >>
> >>
> >> Ralph
> >>
> >>
> >> On 12/11/06 8:07 PM, "Maestas, Christopher Daniel"
> >> <cdmaest_at_[hidden]>
> >> wrote:
> >>
> >>
> >>> Hello,
> >>>
> >>> Sometimes we have users that like to do from within a single job 
> >>> (think schedule within an job scheduler allocation):
> >>> "mpiexec -n X myprog"
> >>> "mpiexec -n Y myprog2"
> >>> Does mpiexec within Open MPI keep track of the node list it
> >> is using
> >>> if it binds to a particular scheduler?
> >>> For example with 4 nodes (2ppn SMP):
> >>> "mpiexec -n 2 myprog"
> >>> "mpiexec -n 2 myprog2"
> >>> "mpiexec -n 1 myprog3"
> >>> And assume this is by-slot allocation we would have the following
> >>> allocation:
> >>> node1 - processor1 - myprog
> >>> - processor2 - myprog
> >>> node2 - processor1 - myprog2
> >>> - processor2 - myprog2
> >>> And for a by-node allocation:
> >>> node1 - processor1 - myprog
> >>> - processor2 - myprog2
> >>> node2 - processor1 - myprog
> >>> - processor2 - myprog2
> >>>
> >>> I think this is possible using ssh cause it shouldn't 
> really matter 
> >>> how many times it spawns, but with something like torque it
> >> would get
> >>> restricted to a max process launch of 4. We would want the third 
> >>> mpiexec to block processes and eventually be run on the first 
> >>> available node allocation that frees up from myprog or 
> myprog2 ....
> >>>
> >>> For example for torque, we had to add the following to 
> osc mpiexec:
> >>> ---
> >>> Finally, since only one mpiexec can be the master at a
> >> time, if your
> >>> code setup requires that mpiexec exit to get a result, you
> >> can start a
> >>> "dummy"
> >>> mpiexec first in your batch
> >>> job:
> >>>
> >>> mpiexec -server
> >>>
> >>> It runs no tasks itself but handles the connections of
> >> other transient
> >>> mpiexec clients.
> >>> It will shut down cleanly when the batch job exits or you
> >> may kill the
> >>> server explicitly.
> >>> If the server is killed with SIGTERM (or HUP or INT), it 
> will exit 
> >>> with a status of zero if there were no clients connected at
> >> the time.
> >>> If there were still clients using the server, the server
> >> will kill all
> >>> their tasks, disconnect from the clients, and exit with status 1.
> >>> ---
> >>>
> >>> So a user ran:
> >>> mpiexec -server
> >>> mpiexec -n 2 myprog
> >>> mpiexec -n 2 myprog2
> >>> And the server kept track of the allocation ... I would
> >> think that the
> >>> orted could do this?
> >>>
> >>> Sorry if this sounds confusing ... But I'm sure it will
> >> clear up with
> >>> any further responses I make. :-) -cdm
> >>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >>
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
>

Re: [OMPI users] Multiple mpiexec's within a job (schedule within a scheduled machinefile/job allocation)

Reply via email to