Apologies if I'm being confusing; I'm probably trying to get at atypical use cases. M and N need not correspond to the number of nodes/ppn nor ppn/nodes available. By node vs. slot doesn't much matter, as long as in the end I don't oversubscribe any node. By slot might be good for efficiency in some apps, but I can't make a general case for it.
I think what you proposed offers some help in the case where N is an integer multiple of the number of available nodes, but perhaps not in other cases. I must be missing something here, so instead of being fully general, perhaps consider a specific case. Suppose we have 4 nodes, 8 ppn (32 slots is I think the ompi language). I might want to schedule, for example 1. M=2 simultaneous N=16 processor jobs: Here I believe what you suggested will work since N is a multiple of the available number of nodes. I could use either npernode 4 or just bynode and I think get the same result: an even distribution of tasks. (similar applies to, e.g., 8x4, 4x8) 2. M=16 simultaneous N=2 processor jobs: it seems if I use bynode or npernode, I would end up with 16 processes on each of the first two nodes (similar applies to, e.g., 32x1 or 10x3). Scheduling many small jobs is a common problem for us. 3. M=3 simultaneous, N=10 processor jobs: I think we'd end up with this distribution (where A-D are nodes and 0-2 jobs) A 0 0 0 1 1 1 2 2 2 B 0 0 0 1 1 1 2 2 2 C 0 0 1 1 2 2 D 0 0 1 1 2 2 where A and B are over-subscribed and there are more than the two unused slots I'd expect in the whole allocation. Again, I can manage all these via a script that partitions the machine files, just wondering which scenarios OpenMPI can manage. Thanks! Brian > -----Original Message----- > From: users-boun...@open-mpi.org > [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Wednesday, July 29, 2009 4:19 PM > To: Open MPI Users > Subject: Re: [OMPI users] Multiple mpiexec's within a job > (schedule within a scheduled machinefile/job allocation) > > Oh my - that does take me back a long way! :-) > > Do you need these processes to be mapped byslot (i.e., do you > care if the process ranks are sharing nodes)? If not, why not > add "-bynode" to your cmd line? > > Alternatively, given the mapping you want, just do > > mpirun -npernode 1 application.exe > > This would launch one copy on each of your N nodes. So if you > fork M times, you'll wind up with the exact pattern you > wanted. And, as each one exits, you could immediately launch > a replacement without worrying about oversubscription. > > Does that help? > Ralph > > PS. we dropped that "persistent" operation - caused way too > many problems with cleanup and other things. :-) > > On Jul 29, 2009, at 3:46 PM, Adams, Brian M wrote: > > > Hi Ralph (all), > > > > I'm resurrecting this 2006 thread for a status check. The > new 1.3.x > > machinefile behavior is great (thanks!) -- I can use > machinefiles to > > manage multiple simultaneous mpiruns within a single torque > > allocation (where the hosts are a subset of $PBS_NODEFILE). > > However, this requires some careful management of machinefiles. > > > > I'm curious if OpenMPI now directly supports the behavior I need, > > described in general in the quote below. Specifically, > given a single > > PBS/Torque allocation of M*N processors, I will run a > serial program > > that will fork M times. Each of the M forked processes > > calls 'mpirun -np N application.exe' and blocks until completion. > > This seems akin to the case you described of "mpiruns executed in > > separate windows/prompts." > > > > What I'd like to see is the M processes "tiled" across the > available > > slots, so all M*N processors are used. What I see instead > appears at > > face value to be the first N resources being oversubscribed M times. > > > > Also, when one of the forked processes returns, I'd like to > be able to > > spawn another and have its mpirun schedule on the resources > freed by > > the previous one that exited. Is any of this possible? > > > > I tried starting an orted (1.3.3, roughly as you suggested > below), but > > got this error: > > > >> orted --daemonize > > [gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file > > runtime/orte_init.c at line 125 > > > ---------------------------------------------------------------------- > > ---- It looks like orte_init failed for some reason; your parallel > > process is likely to abort. There are many reasons that a parallel > > process can fail during orte_init; some of which are due to > > configuration or environment problems. This failure > appears to be an > > internal failure; here's some additional information (which > may only > > be relevant to an Open MPI developer): > > > > orte_ess_base_select failed > > --> Returned value Not found (-13) instead of ORTE_SUCCESS > > > ---------------------------------------------------------------------- > > ---- [gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not > found in file > > orted/orted_main.c at line 323 > > > > I spared the debugging info as I'm not even sure this is a correct > > invocation... > > > > Thanks for any suggestions you can offer! > > Brian > > ---------- > > Brian M. Adams, PhD (bria...@sandia.gov) Optimization and > Uncertainty > > Quantification Sandia National Laboratories, Albuquerque, NM > > http://www.sandia.gov/~briadam > > > > > >> From: Ralph Castain (rhc_at_[hidden]) > >> Date: 2006-12-12 00:46:59 > >> > >> Hi Chris > >> > >> > >> Some of this is doable with today's code....and one of these > >> behaviors is not. :-( > >> > >> > >> Open MPI/OpenRTE can be run in "persistent" mode - this allows > >> multiple jobs to share the same allocation. This works much as you > >> describe (syntax is slightly different, of > >> course!) - the first mpirun will map using whatever mode was > >> requested, then the next mpirun will map starting from where the > >> first one left off. > >> > >> > >> I *believe* you can run each mpirun in the background. > >> However, I don't know if this has really been tested enough to > >> support such a claim. All testing that I know about to-date has > >> executed mpirun in the foreground - thus, your example > would execute > >> sequentially instead of in parallel. > >> > >> > >> I know people have tested multiple mpirun's operating in parallel > >> within a single allocation (i.e., persistent mode) where > the mpiruns > >> are executed in separate windows/prompts. > >> So I suspect you could do something like you describe - > just haven't > >> personally verified it. > >> > >> > >> Where we definitely differ is that Open MPI/RTE will *not* block > >> until resources are freed up from the prior mpiruns. > >> Instead, we will attempt to execute each mpirun immediately - and > >> will error out the one(s) that try to execute without sufficient > >> resources. I imagine we could provide the kind of "flow > control" you > >> describe, but I'm not sure when that might happen. > >> > >> > >> I am (in my copious free time...haha) working on an "orteboot" > >> program that will startup a virtual machine to make the persistent > >> mode of operation a little easier. For now, though, you > can do it by: > >> > >> > >> 1. starting up the "server" using the following command: > >> orted --seed --persistent --scope public [--universe foo] > >> > >> > >> 2. do your mpirun commands. They will automagically find > the "server" > >> and connect to it. If you specified a universe name when > starting the > >> server, then you must specify the same universe name on > your mpirun > >> commands. > >> > >> > >> When you are done, you will have to (unfortunately) > manually "kill" > >> the server and remove its session directory. I have a > program called > >> "ortehalt" > >> in the trunk that will do this cleanly for you, but it > isn't yet in > >> the release distributions. You are welcome to use it, > though, if you > >> are working with the trunk - I can't promise it is > bulletproof yet, > >> but it seems to be working. > >> > >> > >> Ralph > >> > >> > >> On 12/11/06 8:07 PM, "Maestas, Christopher Daniel" > >> <cdmaest_at_[hidden]> > >> wrote: > >> > >> > >>> Hello, > >>> > >>> Sometimes we have users that like to do from within a single job > >>> (think schedule within an job scheduler allocation): > >>> "mpiexec -n X myprog" > >>> "mpiexec -n Y myprog2" > >>> Does mpiexec within Open MPI keep track of the node list it > >> is using > >>> if it binds to a particular scheduler? > >>> For example with 4 nodes (2ppn SMP): > >>> "mpiexec -n 2 myprog" > >>> "mpiexec -n 2 myprog2" > >>> "mpiexec -n 1 myprog3" > >>> And assume this is by-slot allocation we would have the following > >>> allocation: > >>> node1 - processor1 - myprog > >>> - processor2 - myprog > >>> node2 - processor1 - myprog2 > >>> - processor2 - myprog2 > >>> And for a by-node allocation: > >>> node1 - processor1 - myprog > >>> - processor2 - myprog2 > >>> node2 - processor1 - myprog > >>> - processor2 - myprog2 > >>> > >>> I think this is possible using ssh cause it shouldn't > really matter > >>> how many times it spawns, but with something like torque it > >> would get > >>> restricted to a max process launch of 4. We would want the third > >>> mpiexec to block processes and eventually be run on the first > >>> available node allocation that frees up from myprog or > myprog2 .... > >>> > >>> For example for torque, we had to add the following to > osc mpiexec: > >>> --- > >>> Finally, since only one mpiexec can be the master at a > >> time, if your > >>> code setup requires that mpiexec exit to get a result, you > >> can start a > >>> "dummy" > >>> mpiexec first in your batch > >>> job: > >>> > >>> mpiexec -server > >>> > >>> It runs no tasks itself but handles the connections of > >> other transient > >>> mpiexec clients. > >>> It will shut down cleanly when the batch job exits or you > >> may kill the > >>> server explicitly. > >>> If the server is killed with SIGTERM (or HUP or INT), it > will exit > >>> with a status of zero if there were no clients connected at > >> the time. > >>> If there were still clients using the server, the server > >> will kill all > >>> their tasks, disconnect from the clients, and exit with status 1. > >>> --- > >>> > >>> So a user ran: > >>> mpiexec -server > >>> mpiexec -n 2 myprog > >>> mpiexec -n 2 myprog2 > >>> And the server kept track of the allocation ... I would > >> think that the > >>> orted could do this? > >>> > >>> Sorry if this sounds confusing ... But I'm sure it will > >> clear up with > >>> any further responses I make. :-) -cdm > >>> > >>> > >>> _______________________________________________ > >>> users mailing list > >>> users_at_[hidden] > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > >