Re: [OMPI users] Pernode request
Hi Chris I have also implemented --npernode N now as well - it is in the trunk as of r12826. The testing you show below using mpiexec really doesn't tell us the story - we need to know the rank of the various processes (and unfortunately, hostname just tells us the host). There is no way to tell the rank from just the order in which the host names are printed to the screen. I have a test program in our distribution (see orte/test/mpi/hello_hostname.c) that will output both the rank and hostname - it would give us the required info. Regardless, I think it would make sense to provide the flexibility you describe. What if we selected this "by-X-slot" style by using the --npernode option, and allowing the user to combine it with the existing "--byslot" option? This would still launch N procs/node, but with the ranking done by slot. If the user doesn't specify "byslot", then we default to assigning ranks by node. Make sense? If so, I can probably have that going before the holiday. Ralph On 12/11/06 7:51 PM, "Maestas, Christopher Daniel" wrote: > Hello Ralph, > > This is great news! Thanks for doing this. I will try and get around > to it soon before the holiday break. > > The allocation scheme always seems to get to me. From what you describe > that is how I would have seen it. As I've gotten to know osc mpiexec > through the years I think they like to do a first fit approach, but now > that I test it I think the feature needs more testing or I'm not testing > appropriately :-) > --- > $ /apps/x86_64/system/mpiexec-0.82/bin/mpiexec -comm=none -npernode 2 > grep HOSTNAME /etc/sysconfig/network > HOSTNAME="an56" > HOSTNAME="an56" > HOSTNAME="an55" > HOSTNAME="an53" > HOSTNAME="an54" > HOSTNAME="an55" > HOSTNAME="an53" > HOSTNAME="an54" > --- > > I guess I would wonder if it would be possible to switch from the method > what you suggest and also allow a "by-X-slot" style of launch where you > would see for npernode = X and N nodes: > proc1 - node1 > proc2 - node1 > ... > proc(X*1) - node1 > ... > proc(X+1) - node2 > proc(X+2) - node2 > ... > proc(X*2) - node2 > ... > proc(N*X-(X-0)) - nodeN > proc(N*X-(X-1)) - nodeN > ... > proc(X*N-1) - nodeN > proc(X*N) - nodeN > > I think that's how to best describe it. Basically you load until there > are X processes on each node before moving to the next. This may prove > to be more challenging, and I can understand if it would not be deemed > "worthy." :-) > > -cdm > >> -Original Message- >> From: users-boun...@open-mpi.org >> [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain >> Sent: Monday, December 11, 2006 5:41 PM >> To: Open MPI Users >> Subject: Re: [OMPI users] Pernode request >> >> Hi Chris >> >> Okay, we have modified the pernode behavior as you requested >> (on the trunk >> as of r12821)- give it a shot and see if that does it. I have >> not yet added >> the npernode option, but hope to get that soon. >> >> I have a question for you about the npernode option. I am >> assuming that you >> want n procs/node, but that you want it mapped by NODE. For >> example, proc 0 >> goes on the first node, proc 1 goes on the second node, etc. >> until I get one >> on each node; then I wrap back to the beginning and do this >> again until I >> get the specified number of procs on each node. >> >> Correct? >> Ralph >> >>> Ralph, >> >>> I agree with what you stated in points 1-4. That is what we >> are looking >>> for. >>> I understand your point now about the non-MPI users too. :-) >>> >>> Thanks, >>> -cdm >> -Original Message- From: users-bounces_at_[hidden] >> [mailto:users-bounces_at_[hidden]] On Behalf Of Ralph Castain Sent: Wednesday, November 29, 2006 8:01 AM To: Open MPI Users Subject: Re: [OMPI users] Pernode request Hi Chris Thanks for the patience and the clarification - much appreciated. In fact, I have someone that needs to learn more about the >> code base, so I think I will assign this to him. At the least, he will have >> to learn a lot more about the mapper! I have no problem with modifying the pernode behavior to >> deal with the case of someone specifying -np as you describe. It would be >> relatively easy to check. As I understand it, you want the behavior to be: 1. if no -np is specified, launch one proc/node across >> entire allocation 2. if -np n is specified AND n is less than the number of allocated nodes, then launch one proc/node up to the specified >> number. Of course, this is identical to just doing -np n -bynode, but that's >> immaterial. ;-) 3. if -np n is specified AND n is greater than the number >> of allocated nodes, error message and exit 4. add a -npernode n option that launches n procs/node, >> subject to the same tests above. Can you confirm? Finally, I think you misunderstood my comment about th
Re: [OMPI users] Multiple mpiexec's within a job (schedule within a scheduled machinefile/job allocation)
Hi Chris Some of this is doable with today's codeand one of these behaviors is not. :-( Open MPI/OpenRTE can be run in "persistent" mode - this allows multiple jobs to share the same allocation. This works much as you describe (syntax is slightly different, of course!) - the first mpirun will map using whatever mode was requested, then the next mpirun will map starting from where the first one left off. I *believe* you can run each mpirun in the background. However, I don't know if this has really been tested enough to support such a claim. All testing that I know about to-date has executed mpirun in the foreground - thus, your example would execute sequentially instead of in parallel. I know people have tested multiple mpirun's operating in parallel within a single allocation (i.e., persistent mode) where the mpiruns are executed in separate windows/prompts. So I suspect you could do something like you describe - just haven't personally verified it. Where we definitely differ is that Open MPI/RTE will *not* block until resources are freed up from the prior mpiruns. Instead, we will attempt to execute each mpirun immediately - and will error out the one(s) that try to execute without sufficient resources. I imagine we could provide the kind of "flow control" you describe, but I'm not sure when that might happen. I am (in my copious free time...haha) working on an "orteboot" program that will startup a virtual machine to make the persistent mode of operation a little easier. For now, though, you can do it by: 1. starting up the "server" using the following command: orted --seed --persistent --scope public [--universe foo] 2. do your mpirun commands. They will automagically find the "server" and connect to it. If you specified a universe name when starting the server, then you must specify the same universe name on your mpirun commands. When you are done, you will have to (unfortunately) manually "kill" the server and remove its session directory. I have a program called "ortehalt" in the trunk that will do this cleanly for you, but it isn't yet in the release distributions. You are welcome to use it, though, if you are working with the trunk - I can't promise it is bulletproof yet, but it seems to be working. Ralph On 12/11/06 8:07 PM, "Maestas, Christopher Daniel" wrote: > Hello, > > Sometimes we have users that like to do from within a single job (think > schedule within an job scheduler allocation): > "mpiexec -n X myprog" > "mpiexec -n Y myprog2" > Does mpiexec within Open MPI keep track of the node list it is using if > it binds to a particular scheduler? > For example with 4 nodes (2ppn SMP): > "mpiexec -n 2 myprog" > "mpiexec -n 2 myprog2" > "mpiexec -n 1 myprog3" > And assume this is by-slot allocation we would have the following > allocation: > node1 - processor1 - myprog > - processor2 - myprog > node2 - processor1 - myprog2 > - processor2 - myprog2 > And for a by-node allocation: > node1 - processor1 - myprog > - processor2 - myprog2 > node2 - processor1 - myprog > - processor2 - myprog2 > > I think this is possible using ssh cause it shouldn't really matter how > many times it spawns, but with something like torque it would get > restricted to a max process launch of 4. We would want the third > mpiexec to block processes and eventually be run on the first available > node allocation that frees up from myprog or myprog2 > > For example for torque, we had to add the following to osc mpiexec: > --- >Finally, since only one mpiexec can be the master at a time, if > your code setup requires >that mpiexec exit to get a result, you can start a "dummy" > mpiexec first in your batch >job: > > mpiexec -server > >It runs no tasks itself but handles the connections of other > transient mpiexec clients. >It will shut down cleanly when the batch job exits or you may > kill the server explicitly. >If the server is killed with SIGTERM (or HUP or INT), it will > exit with a status of zero >if there were no clients connected at the time. If there were > still clients using the >server, the server will kill all their tasks, disconnect from > the clients, and exit with >status 1. > --- > > So a user ran: > mpiexec -server > mpiexec -n 2 myprog > mpiexec -n 2 myprog2 > And the server kept track of the allocation ... I would think that the > orted could do this? > > Sorry if this sounds confusing ... But I'm sure it will clear up with > any further responses I make. :-) > -cdm > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Pernode request
Ralph, I figured I should of run an mpi program ...here's what it does (seems to be by-X-slot style): --- $ /apps/x86_64/system/mpiexec-0.82/bin/mpiexec -npernode 2 mpi_hello Hello, I am node an41 with rank 0 Hello, I am node an41 with rank 1 Hello, I am node an39 with rank 4 Hello, I am node an40 with rank 2 Hello, I am node an38 with rank 6 Hello, I am node an39 with rank 5 Hello, I am node an38 with rank 7 Hello, I am node an40 with rank 3 --- What you describe makes sense to me. Thanks! > -Original Message- > From: users-boun...@open-mpi.org > [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Monday, December 11, 2006 10:27 PM > To: Open MPI Users > Subject: Re: [OMPI users] Pernode request > > Hi Chris > > I have also implemented --npernode N now as well - it is in > the trunk as of > r12826. > > The testing you show below using mpiexec really doesn't tell > us the story - > we need to know the rank of the various processes (and unfortunately, > hostname just tells us the host). There is no way to tell the > rank from just > the order in which the host names are printed to the screen. > I have a test > program in our distribution (see > orte/test/mpi/hello_hostname.c) that will > output both the rank and hostname - it would give us the > required info. > > Regardless, I think it would make sense to provide the flexibility you > describe. What if we selected this "by-X-slot" style by using > the --npernode > option, and allowing the user to combine it with the existing > "--byslot" > option? This would still launch N procs/node, but with the > ranking done by > slot. If the user doesn't specify "byslot", then we default > to assigning > ranks by node. > > Make sense? If so, I can probably have that going before the holiday. > > Ralph > > > > On 12/11/06 7:51 PM, "Maestas, Christopher Daniel" > > wrote: > > > Hello Ralph, > > > > This is great news! Thanks for doing this. I will try and > get around > > to it soon before the holiday break. > > > > The allocation scheme always seems to get to me. From what > you describe > > that is how I would have seen it. As I've gotten to know > osc mpiexec > > through the years I think they like to do a first fit > approach, but now > > that I test it I think the feature needs more testing or > I'm not testing > > appropriately :-) > > --- > > $ /apps/x86_64/system/mpiexec-0.82/bin/mpiexec -comm=none > -npernode 2 > > grep HOSTNAME /etc/sysconfig/network > > HOSTNAME="an56" > > HOSTNAME="an56" > > HOSTNAME="an55" > > HOSTNAME="an53" > > HOSTNAME="an54" > > HOSTNAME="an55" > > HOSTNAME="an53" > > HOSTNAME="an54" > > --- > > > > I guess I would wonder if it would be possible to switch > from the method > > what you suggest and also allow a "by-X-slot" style of > launch where you > > would see for npernode = X and N nodes: > > proc1 - node1 > > proc2 - node1 > > ... > > proc(X*1) - node1 > > ... > > proc(X+1) - node2 > > proc(X+2) - node2 > > ... > > proc(X*2) - node2 > > ... > > proc(N*X-(X-0)) - nodeN > > proc(N*X-(X-1)) - nodeN > > ... > > proc(X*N-1) - nodeN > > proc(X*N) - nodeN > > > > I think that's how to best describe it. Basically you load > until there > > are X processes on each node before moving to the next. > This may prove > > to be more challenging, and I can understand if it would > not be deemed > > "worthy." :-) > > > > -cdm > > > >> -Original Message- > >> From: users-boun...@open-mpi.org > >> [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain > >> Sent: Monday, December 11, 2006 5:41 PM > >> To: Open MPI Users > >> Subject: Re: [OMPI users] Pernode request > >> > >> Hi Chris > >> > >> Okay, we have modified the pernode behavior as you requested > >> (on the trunk > >> as of r12821)- give it a shot and see if that does it. I have > >> not yet added > >> the npernode option, but hope to get that soon. > >> > >> I have a question for you about the npernode option. I am > >> assuming that you > >> want n procs/node, but that you want it mapped by NODE. For > >> example, proc 0 > >> goes on the first node, proc 1 goes on the second node, etc. > >> until I get one > >> on each node; then I wrap back to the beginning and do this > >> again until I > >> get the specified number of procs on each node. > >> > >> Correct? > >> Ralph > >> > >>> Ralph, > >> > >>> I agree with what you stated in points 1-4. That is what we > >> are looking > >>> for. > >>> I understand your point now about the non-MPI users too. :-) > >>> > >>> Thanks, > >>> -cdm > >> > -Original Message- > From: users-bounces_at_[hidden] > >> [mailto:users-bounces_at_[hidden]] On > Behalf Of Ralph Castain > Sent: Wednesday, November 29, 2006 8:01 AM > To: Open MPI Users > Subject: Re: [OMPI users] Pernode request > > Hi Chris > > Thanks for the patience and the clarification - much > appreciated. In > fact, I h
Re: [OMPI users] mpool_gm_module error
On Dec 11, 2006, at 4:04 PM, Reese Faucette wrote: GM: gm_register_memory will be able to lock XXX pages (YYY MBytes) Is there a way to tell GM to pull more memory from the system? GM reserves all IOMMU space that the OS is willing to give it, so what is needed is a way to tell the OS and/or machine to allow a bigger chunk of IOMMU space to be grabbed by GM. Note that IOMMU space is not the same as the amount of memory a process can used, it is the amount of DMA- able memory that a driver can have "registered" at any one time. MPI then has to manage this space much like a VMM has to manage physical memory vs. page space. -reese Well I have no luck in finding a way to up the amount the system will allow GM to use. What is a recommended solution? Is this even a problem in most cases? Like am i encountering a corner case? ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] mpool_gm_module error
Well I have no luck in finding a way to up the amount the system will allow GM to use. What is a recommended solution? Is this even a problem in most cases? Like am i encountering a corner case? upping the limit was not what i'm suggesting as a fix, just pointing out that it is kind of low and even with a fully working ompi or mpich-gm. ompi should still work, even if the IOMMU limit is low. Since you are running 1 thread per CPU (== 2 total), it is possible (likely) that the 1st thread is grabbing all the available registerable memory, leaving not even enough for the second thread to even start. I recommend you try the "mpool_rdma_rcache_size_limit" that Gleb mentions - the equivalent setting is used in MPICH-GM in similar situations. Set this to about 180 MB and run with that. Gleb - I assume that when registration needs exceed "mpool_rdma_rcache_size_limit", that previously registered memory is unregistered much as virtual memory is swapped out? regards, -reese
Re: [OMPI users] mpool_gm_module error
On Tue, Dec 12, 2006 at 12:58:00PM -0800, Reese Faucette wrote: > > Well I have no luck in finding a way to up the amount the system will > > allow GM to use. What is a recommended solution? Is this even a > > problem in most cases? Like am i encountering a corner case? > > upping the limit was not what i'm suggesting as a fix, just pointing out > that it is kind of low and even with a fully working ompi or mpich-gm. ompi > should still work, even if the IOMMU limit is low. > > Since you are running 1 thread per CPU (== 2 total), it is possible (likely) > that the 1st thread is grabbing all the available registerable memory, > leaving not even enough for the second thread to even start. I recommend > you try the "mpool_rdma_rcache_size_limit" that Gleb mentions - the > equivalent setting is used in MPICH-GM in similar situations. Set this to > about 180 MB and run with that. > > Gleb - I assume that when registration needs exceed > "mpool_rdma_rcache_size_limit", that previously registered memory is > unregistered much as virtual memory is swapped out? > If previously registered memory is in use than registration returns error to upper layer and operation is retried late. Otherwise unused memory is unregistered. The code for mpool_rdma_rcache_size_limit is not on trunk yet. It is on tmp branch /tmp/gleb-mpool, I don't know if /tmp is open to everyone. If not I can send the patch. -- Gleb.
Re: [OMPI users] mpool_gm_module error
On Dec 12, 2006, at 4:24 PM, Gleb Natapov wrote: On Tue, Dec 12, 2006 at 12:58:00PM -0800, Reese Faucette wrote: Well I have no luck in finding a way to up the amount the system will allow GM to use. What is a recommended solution? Is this even a problem in most cases? Like am i encountering a corner case? upping the limit was not what i'm suggesting as a fix, just pointing out that it is kind of low and even with a fully working ompi or mpich- gm. ompi should still work, even if the IOMMU limit is low. Since you are running 1 thread per CPU (== 2 total), it is possible (likely) that the 1st thread is grabbing all the available registerable memory, leaving not even enough for the second thread to even start. I recommend you try the "mpool_rdma_rcache_size_limit" that Gleb mentions - the equivalent setting is used in MPICH-GM in similar situations. Set this to about 180 MB and run with that. Gleb - I assume that when registration needs exceed "mpool_rdma_rcache_size_limit", that previously registered memory is unregistered much as virtual memory is swapped out? If previously registered memory is in use than registration returns error to upper layer and operation is retried late. Otherwise unused memory is unregistered. The code for mpool_rdma_rcache_size_limit is not on trunk yet. It is on tmp branch /tmp/gleb-mpool, I don't know if / tmp is open to everyone. If not I can send the patch. If you could mail me the patch ill try it out. -- Gleb. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] CfP Workshops on Virtualization/XEN in HPC Cluster and Grid Computing Environments (XHPC'07, VHPC'07)
Apologies if you received multiple copies of this message. === CALL FOR PAPERS Workshop on Virtualization/Xen in High-Performance Cluster and Grid Computing (XHPC'07) as part of The 16th IEEE International Symposium on High Performance Distributed Computing (HPDC-16), June 27-29, 2007, Monterey, California. and Workshop on Virtualization in High-Performance Cluster and Grid Computing (VHPC'07) as part of The 13th International European Conference on Parallel and Distributed Computing (Euro-Par 2007), August 28-31, 2007, IRISA, Rennes, France. === List-Post: users@lists.open-mpi.org Date: June 27-29, 2007 and August 28-31, 2007 HPDC-16: http://www.isi.edu/hpdc2007/ Euro-Par 2007: http://europar2007.irisa.fr/ Workshop URL: http://xhpc.wu-wien.ac.at/ (due date for both: January 26, Abstracts Jan 12) Scope: Virtual machine monitors are reaching wide-spread adoption in a variety of operating systems as well as scientific educational and operational usage areas. With their low overhead, hypervisors allows for concurrently running large numbers of virtual machines, providing each encapsulation, isolation and in the case of Xen, network-wide CPU migratability. VMMs offer a network-wide abstraction layer of individual machine resources to OS environments, thereby opening whole new cluster-and grid high-performance computing (HPC) architectures and HPC services options. With VMMs finding applications in HPC environments, these workshops aim to bring together researchers and practitioners active on virtualization in distributed and high-performance cluster and grid computing environments. The workshops will be one day in length, composed of 20 min paper presentations, each followed by 10 min discussion sections. Presentations may be accompanied with interactive demonstrations. The workshop will end with a 30 min panel discussion by presenters. TOPICS Topics include, but are not limited to, the following subject matters: - Virtualization/Xen in cluster and grid environments - Workload characterizations for VM-based clusters - VM cluster and grid architectures - Linux Hypervisors - Cluster reliability, fault-tolerance, and security - Compute job entry and scheduling - Compute workload load leveling - Cluster and grid filesystems for VMs - VMMs, VMs and QoS guarantees - Research and education use cases - VM cluster distribution algorithms - MPI, PVM on virtual machines - System sizing - Hardware support for virtualization - High-speed interconnects in Xen/hypervisors - Hypervisor extensions and utilities for cluster and grid computing - Network architectures for VM-based clusters - VMMs/Hypervisors on large SMP machines - Performance models - Performance management and tuning hosts and guest VMs - VMM performance tuning on various load types - Xen/other VMM cluster/grid tools - Device access from VMs - Management, deployment of clusters and grid environments with VMs PAPER SUBMISSION Papers submitted to each workshop will be reviewed by at least two members of the program committee and external reviewers. Submissions should include abstract, key words, the e-mail address of the corresponding author, and must not exceed 10 pages, including tables and figures. Submission of a paper should be regarded as a commitment that, should the paper be accepted, at least one of the authors will register and attend the conference to present the work. An award for best student paper will be given. Papers can only be submitted for one workshop, but not both at the same time. XHPC'07: Accepted papers will be included in proceedings published by the IEEE Computer Society Press. The IEEE 8 1/2 x 11 CS format must be used. Submission can be in either PDF or Postscript. ftp://pubftp.computer.org/Press/Outgoing/proceedings/ VHPC'07: Accepted papers will be included in proceedings published in the Springer LNCS series. The format must be according to the Springer LNCS Style and preferably be in LaTeX or FrameMaker, although submissions using the LNCS Word template are also possible. Initial submissions are in PDF, accepted papers will be requested to provided source files. http://www.springer.de/comp/lncs/authors.html Submission Link: https://edas.info/newPaper.php?c=5189&submit=0&;; (add either XHPC: or VHPC: in the title of the paper to designate the target workshop) IMPORTANT DATES January 12, 2007 - Abstract submissions due Paper submission due: January 28, 2007 Acceptance notification: March 26, 2007 Camera-ready due: April 27, 2007 (XHPC), May 26, 2007 (VHPC) Conference: June 27-29, 2007 (XHPC) and August 28-31, 2007 (VHPC) CHAIR Michael Alexander (chair), WU Vienna, Austria Stephen Childs (co-chair), Trinity College, Dublin, Ireland PROGRAM COMMITTEE Jussara Almeida, Federal University of Minas Gerais,