[OMPI users] implementation of a message logging protocol
Dear all, I am currently working on a fault tolerant protocol for message passing applications based on message logging. For my experimentations, I want to implement my protocol in a MPI library. I know that message logging protocols have already been implemented in MPICH with MPICH-V. I'm wondering if in the actual state of Open MPI it is possible to do the same kind of work in this library ? Is there somebody currently working on the same subject ? Best regards, Thomas Ropars.
Re: [OMPI users] deadlock on barrier
Is this a TCP-based cluster? If so, do you have multiple IP addresses on your frontend machine? Check out these two FAQ entries to see if they help: http://www.open-mpi.org/faq/?category=tcp#tcp-routability http://www.open-mpi.org/faq/?category=tcp#tcp-selection On Mar 21, 2007, at 4:51 PM, tim gunter wrote: i am experiencing some issues w/ openmpi 1.2 running on a rocks 4.2.1 cluster(the issues also appear to occur w/ openmpi 1.1.5 and 1.1.4). when i run my program with the frontend in the list of nodes, they deadlock. when i run my program without the frontend in the list of nodes, they run to completion. the simplest test program that does this(test1.c) does an "MPI_Init", followed by an "MPI_Barrier", and a "MPI_Finalize". so the following deadlocks: /users/gunter $ mpirun -np 3 -H frontend,compute-0-0,compute-0-1 ./test1 host:compute-0-1.local made it past the barrier, ret:0 mpirun: killing job... mpirun noticed that job rank 0 with PID 15384 on node frontend exited on signal 15 (Terminated). 2 additional processes aborted (not shown) this runs to completion: /users/gunter $ mpirun -np 3 -H compute-0-0,compute-0-1,compute-0-2 ./test1 host:compute-0-1.local made it past the barrier, ret:0 host:compute-0-0.local made it past the barrier, ret:0 host:compute-0-2.local made it past the barrier, ret:0 if i have the compute nodes send a message to the frontend prior to the barrier, it runs to completion: /users/gunter $ mpirun -np 3 -H frontend,compute-0-0,compute-0-1 ./test2 0 host: frontend.domain node: 0 is the master host: compute-0-0.local node: 1 sent: 1 to:0 host: compute-0-1.local node: 2 sent: 2 to:0 host: frontend.domain node: 0 recv: 1 from: 1 host: frontend.domain node: 0 recv: 2 from: 2 host: frontend.domain made it past the barrier, ret:0 host: compute-0-1.local made it past the barrier, ret:0 host: compute-0-0.local made it past the barrier, ret:0 if i have a different node function as the master, it deadlocks: /users/gunter $ mpirun -np 3 -H frontend,compute-0-0,compute-0-1 ./test2 1 host: compute-0-0.local node: 1 is the master host: compute-0-1.local node: 2 sent: 2 to:1 mpirun: killing job... mpirun noticed that job rank 0 with PID 15411 on node frontend exited on signal 15 (Terminated). 2 additional processes aborted (not shown) how is it that in the first example, one node makes it past the barrier, and the rest deadlock? these programs both run to completion on two other MPI implementations. is there something mis-configured on my cluster? or is this potentially an openmpi bug? what is the best way to debug this? any help would be appreciated! --tim ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] MPI processes swapping out
Are you using a scheduler on your system? More specifically, does Open MPI know that you have for process slots on each node? If you are using a hostfile and didn't specify "slots=4" for each host, Open MPI will think that it's oversubscribing and will therefore call sched_yield() in the depths of its progress engine. On Mar 21, 2007, at 5:08 PM, Heywood, Todd wrote: P.s. I should have said this this is a pretty course-grained application, and netstat doesn't show much communication going on (except in stages). On 3/21/07 4:21 PM, "Heywood, Todd" wrote: I noticed that my OpenMPI processes are using larger amounts of system time than user time (via vmstat, top). I'm running on dual-core, dual-CPU Opterons, with 4 slots per node, where the program has the nodes to themselves. A closer look showed that they are constantly switching between run and sleep states with 4-8 page faults per second. Why would this be? It doesn't happen with 4 sequential jobs running on a node, where I get 99% user time, maybe 1% system time. The processes have plenty of memory. This behavior occurs whether I use processor/memory affinity or not (there is no oversubscription). Thanks, Todd ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] threading
Open MPI currently has minimal use of hidden "progress" threads, but we will likely be experimenting with more usage of them over time (previous MPI implementations have shown that progress threads can be a big performance win for large messages, although they do tend to add a bit of latency). To answer your direct question, when you ask Open MPI for N processes (E.g., "mpirun -np a.out"), you'll get N unix processes. Open MPI will not create N threads (or split threads across nodes without oversubscription such that you get a total of N ranks in MPI_COMM_WORLD). Previous MPI implementations have tried this kind of scheme (launching threads as MPI processes), but (IMHO) it violated the Law of Least Astonishment (see http://www.canonical.org/~kragen/tao-of- programming.html) in that the user's MPI application was then subject to the constraints of multi-threaded programming. So most (all?) modern MPI implementations that I am aware of deal with operating system processes as individual MPI_COMM_WORLD ranks (as opposed to threads). On Mar 21, 2007, at 5:29 PM, David Burns wrote: I have used POSIX threading and Open MPI without problems on our Opteron 2216 Cluster (4 cores per node). Moving to core-level parallelization with multi threading resulted in significant performance gains. Sam Adams wrote: I have been looking, but I haven't really found a good answer about system level threading. We are about to get a new cluster of dual-processor quad-core nodes or 8 cores per node. Traditionally I would just tell MPI to launch two processes per dual processor single core node, but with eight cores on a node, having 8 processes seems inefficient. My question is this: does OpenMPI sense that there are multiple cores on a node and use something like pthreads instead of creating new processes automatically when I request 8 processes for a node, or should I run a single process per node and use OpenMP or pthreads explicitly to get better performance on a per node basis? ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] portability of the executables compiled with OpenMPI
On Mar 15, 2007, at 12:18 PM, Michael wrote: I'm having trouble with the portability of executables compiled with OpenMPI. I suspect the sysadms on the HPC system I'm using changed something because I think it worked previously. Situation: I'm compiling my code locally on a machine with just ethernet interfaces and OpenMPI 1.1.2 that I built. When I attempt to run that executable on a HPC machine with OpenMPI 1.1.2 and InfiniBand interfaces I get messages about "can't find libmosal.so.0.0" -- I'm certain this wasn't happening earlier. I can compile on this machine and run on it, even though there is no libmosal.* in my path. mpif90 --showme on this system gives me: /opt/compiler/intel/compiler91/x86_64/bin/ifort -I/opt/mpi/x86_64/ intel/9.1/openmpi-1.1.4/include -pthread -I/opt/mpi/x86_64/intel/9.1/ openmpi-1.1.4/lib -L/opt/mpi/x86_64/intel/9.1/openmpi-1.1.4/lib -L/ opt/gm/lib64 -lmpi_f90 -lmpi -lorte -lopal -lgm -lvapi -lmosal -lrt - lnuma -ldl -Wl,--export-dynamic -lnsl -lutil -ldl Based on this output, I assume you have configured OMPI with either -- enable-static or otherwise including all plugins in libmpi.so, right? I suspect that read access to libmosal.so has been removed and somehow when I link on this machine I'm getting a static library, i.e. libmosal.a Does this make any sense? This would be consistent with what you described above -- that libmosal.so (a VAPI support library) is available on the initial machine, so your MPI executable will have a runtime dependency on it. But then on the second machine, libmosal.so is not available, so the runtime dependency fails. But if you compile manually, libmosal.a is available and therefore the application can be created with compile/link-time resolution (vs. runtime resolution). Is there a flag in this compile line that permits linking an executable even when the person doing the linking does not have access to all the libraries, i.e. export-dynamic? No. All the same Linux/POSIX linking rules apply for creating an executable; we're not doing anything funny in this area. FYI: --export-dynamic tells the linker that symbols in the libraries should be available to plugins that are opened later. It's probably not relevant for the case where you're not opening any plugins at runtime, but we don't differentiate between this case because the decision whether to open plugins or not is a runtime decision, not a compile/link-time decision. -- Jeff Squyres Cisco Systems
Re: [OMPI users] portability of the executables compiled with OpenMPI
On Mar 15, 2007, at 5:02 PM, Michael wrote: I would like to hear just how portable an executable compiled against OpenMPI shared libraries should be. This is a hard question to answer: 1. We have not done this explicit kind of testing. 2. Open MPI's libmpi.so itself is plain vanilla C. If you have an application that is already portable, linking it against Open MPI should not cause it to be less portable. 3. Open MPI, however, can use many support libraries (e.g., libmosal in your previous mail). This myriad of extra libraries may create difficulties in creating a truly portable application. The best practices that I have seen have been: - start with an application that itself is already portable (without MPI) - compile everything 100% static But this has drawbacks as well -- consider if you link in libmosal.a to your MPI application and then take it to another system that has a slightly different version of VAPI (e.g., a kernel interface changed). Although your application will load and start running (i.e., no runtime linker resolution failures), it may fail in unpredictable ways later because the libmosal.a in your application calls down to the kernel in ways that are unsupported by the VAPI/ libmosal on the current system. Make sense? This is unfortunately not an MPI (or Open MPI) specific issue; it's a larger problem of creating truly portable software. To have a better chance of success, you probably want to ensure that all relevant points of interaction between your application and the outside system are either the same version or "compatible enough". - high-speed networking support libraries - resource manager support libraries - libc - ...etc. Specifically, even though you won't be looking for .so's at runtime, you need to ensure that the way the .a's compiled into your application interact with the system is either the same way or "close enough" to how the corresponding support libraries work on the target machine. All this being said, Open MPI did try to take steps in its design to be able to effect more portability (e.g., for ISV's). Theoretically -- we have not explicitly tested this -- the following setup may provide a better degree of portability: - have the same version of Open MPI available on each machine, compiled against whatever support libraries are relevant on that machine (using plugins, not --enable-static). - compile your application *dynamically* against Open MPI. Note that some of the upper-level configuration of Open MPI must be either the same or "close enough" between machines such that runtime linking will work properly (e.g., don't use a 32 bit libmpi on one machine and a 64 bit libmpi on another, etc. There's more details here, but you get the general idea) - ensure that other (non-MPI-related) interaction points in your application are the same or "close enough" to be portable By linking dynamically against Open MPI (which is plain vanilla C), the application will only be looking for Open MPI's plain C support libraries -- not the other support libraries (such as libmosal), because those are linked against OMPI's plugins -- not libmpi.so (etc.). This design effectively takes MPI out of the portability equation. That's the theory, anyway. :-) I skipped many nit-picky details, so I'm sure there will be issues to figure out. But *in theory*, it's possible... I'm compiling on a Debian Linux system with dual 1.3 GHz AMD Opterons per node and an internal network of dual gigabit ethernet. I'm planning on a SUSE Linux Enterprise Server 9 system with dual 3.6 GHz Intel Xeon EM64T per node and an internal network using Myrinet. I can't speak for how portable Myrinet support libraries are... Myricom? -- Jeff Squyres Cisco Systems
Re: [OMPI users] portability of the executables compiled with OpenMPI
On Mar 22, 2007, at 7:55 AM, Jeff Squyres wrote: On Mar 15, 2007, at 12:18 PM, Michael wrote: Situation: I'm compiling my code locally on a machine with just ethernet interfaces and OpenMPI 1.1.2 that I built. When I attempt to run that executable on a HPC machine with OpenMPI 1.1.2 and InfiniBand interfaces I get messages about "can't find libmosal.so.0.0" -- I'm certain this wasn't happening earlier. I can compile on this machine and run on it, even though there is no libmosal.* in my path. mpif90 --showme on this system gives me: /opt/compiler/intel/compiler91/x86_64/bin/ifort -I/opt/mpi/x86_64/ intel/9.1/openmpi-1.1.4/include -pthread -I/opt/mpi/x86_64/intel/9.1/ openmpi-1.1.4/lib -L/opt/mpi/x86_64/intel/9.1/openmpi-1.1.4/lib -L/ opt/gm/lib64 -lmpi_f90 -lmpi -lorte -lopal -lgm -lvapi -lmosal -lrt - lnuma -ldl -Wl,--export-dynamic -lnsl -lutil -ldl Based on this output, I assume you have configured OMPI with either -- enable-static or otherwise including all plugins in libmpi.so, right? No, I did not configure OpenMPI on this machine. I believe OpenMPI was configured not static by the installers based on the messages and the dependency on the missing libraries. The issue was that some of the 1000+ nodes on this major HPC machine were missing libraries needed for OpenMPI but because of the low usage of OpenMPI I'm the first to discover the problem. For whatever reason these libraries are not on the front-end machines that feed the main system. It's always nice running OpenMPI on your own machine but not everyone can always do that. The way I read my experience is that OpenMPI's libmpi.so depends on different libraries on different machines, this means that if you don't compile static you can compile on a machine that does not have libraries for expensive interfaces and run on another machine with those expensive interfaces -- that's what I'm am doing now successfully. Michael
Re: [OMPI users] portability of the executables compiled with OpenMPI
For your reference: The following cross compile/run combination with OpenMPI 1.1.4 is currently working for me: I'm compiling on a Debian Linux system with dual 1.3 GHz AMD Opterons per node and an internal network of dual gigabit ethernet. With OpenMPI compiled with Intel Fortran 9.1.041 and gcc 3.3.5 I'm running on a SUSE Linux Enterprise Server 9 system with dual 3.6 GHz Intel Xeon EM64T per node and an internal network using Myrinet. OpenMPI compiled with Intel Fortran 9.1.041 and Intel icc 9.1.046 There is enough compatibility between the two different libmpi.so's that I do not have a problem. I have to periodically check the second system to see if it has been updated in which I case I have to update my system. Michael
Re: [OMPI users] MPI processes swapping out
Yes, I'm using SGE. I also just noticed that when 2 tasks/slots run on a 4-core node, the 2 tasks are still cycling between run and sleep, with higher system time than user time. Ompi_info shows the MCA parameter mpi_yield_when_idle to be 0 (aggressive), so that suggests the tasks aren't swapping out on bloccking calls. Still puzzled. Thanks, Todd On 3/22/07 7:36 AM, "Jeff Squyres" wrote: > Are you using a scheduler on your system? > > More specifically, does Open MPI know that you have for process slots > on each node? If you are using a hostfile and didn't specify > "slots=4" for each host, Open MPI will think that it's > oversubscribing and will therefore call sched_yield() in the depths > of its progress engine. > > > On Mar 21, 2007, at 5:08 PM, Heywood, Todd wrote: > >> P.s. I should have said this this is a pretty course-grained >> application, >> and netstat doesn't show much communication going on (except in >> stages). >> >> >> On 3/21/07 4:21 PM, "Heywood, Todd" wrote: >> >>> I noticed that my OpenMPI processes are using larger amounts of >>> system time >>> than user time (via vmstat, top). I'm running on dual-core, dual-CPU >>> Opterons, with 4 slots per node, where the program has the nodes to >>> themselves. A closer look showed that they are constantly >>> switching between >>> run and sleep states with 4-8 page faults per second. >>> >>> Why would this be? It doesn't happen with 4 sequential jobs >>> running on a >>> node, where I get 99% user time, maybe 1% system time. >>> >>> The processes have plenty of memory. This behavior occurs whether >>> I use >>> processor/memory affinity or not (there is no oversubscription). >>> >>> Thanks, >>> >>> Todd >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Fault Tolerance
LAM/MPI was able to checkpoint/restart an entire MPI job as you mention. Open MPI is now able to checkpoint/restart as well. In the past week I added to the Open MPI trunk a LAM/MPI-like checkpoint/ restart implementation. In Open MPI we revisited many of the design decisions from the LAM/MPI development and improved on them quite a bit. At the moment there is no documentation on how to use it (egg on my face actually). I'm working on developing the documentation, and I will send a note to the users list once it is available. Cheers, Josh On Mar 21, 2007, at 1:18 PM, Thomas Spraggins wrote: To migrate processes, you need to be able to checkpoint them. I believe that LAM-MPI is the only MPI implementation that allows this, although I have never used LAM-MPI. Good luck. Tom Spraggins t...@virginia.edu On Mar 21, 2007, at 1:09 PM, Mohammad Huwaidi wrote: Hello folks, I am trying to write some fault-tolerance systems with the following criteria: 1) Recover any software/hardware crashes 2) Dynamically Shrink and grow. 3) Migrate processes among machines. Does anyone has examples of code? What MPI platform is recommended to accomplish such requirements? I am using three MPI platforms and each has it own issues: 1) MPICH2 - good multi-threading support, but bad fault-tolerance mechanisms. 2) OpenMPI - Does not support multi-threading properly and cannot have it trap exceptions yet. 3) FT-MPI - Old and does not support multi-threading at all. Any suggestions? -- Regards, Mohammad Huwaidi We can't resolve problems by using the same kind of thinking we used when we created them. --Albert Einstein ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users Josh Hursey jjhur...@open-mpi.org http://www.open-mpi.org/
Re: [OMPI users] deadlock on barrier
On 3/22/07, Jeff Squyres wrote: Is this a TCP-based cluster? yes If so, do you have multiple IP addresses on your frontend machine? Check out these two FAQ entries to see if they help: http://www.open-mpi.org/faq/?category=tcp#tcp-routability http://www.open-mpi.org/faq/?category=tcp#tcp-selection ok, using the internal interfaces only fixed the problem. it is a little confusing that when this happens, one machine would make it past the barrier, and the others would not. thanks Jeff! --tim
Re: [OMPI users] MPI processes swapping out
Just for clarification: ompi_info only shows the *default* value of the MCA parameter. In this case, mpi_yield_when_idle defaults to aggressive, but that value is reset internally if the system sees an "oversubscribed" condition. The issue here isn't how many cores are on the node, but rather how many were specifically allocated to this job. If the allocation wasn't at least 2 (in your example), then we would automatically reset mpi_yield_when_idle to be non-aggressive, regardless of how many cores are actually on the node. Ralph On 3/22/07 7:14 AM, "Heywood, Todd" wrote: > Yes, I'm using SGE. I also just noticed that when 2 tasks/slots run on a > 4-core node, the 2 tasks are still cycling between run and sleep, with > higher system time than user time. > > Ompi_info shows the MCA parameter mpi_yield_when_idle to be 0 (aggressive), > so that suggests the tasks aren't swapping out on bloccking calls. > > Still puzzled. > > Thanks, > Todd > > > On 3/22/07 7:36 AM, "Jeff Squyres" wrote: > >> Are you using a scheduler on your system? >> >> More specifically, does Open MPI know that you have for process slots >> on each node? If you are using a hostfile and didn't specify >> "slots=4" for each host, Open MPI will think that it's >> oversubscribing and will therefore call sched_yield() in the depths >> of its progress engine. >> >> >> On Mar 21, 2007, at 5:08 PM, Heywood, Todd wrote: >> >>> P.s. I should have said this this is a pretty course-grained >>> application, >>> and netstat doesn't show much communication going on (except in >>> stages). >>> >>> >>> On 3/21/07 4:21 PM, "Heywood, Todd" wrote: >>> I noticed that my OpenMPI processes are using larger amounts of system time than user time (via vmstat, top). I'm running on dual-core, dual-CPU Opterons, with 4 slots per node, where the program has the nodes to themselves. A closer look showed that they are constantly switching between run and sleep states with 4-8 page faults per second. Why would this be? It doesn't happen with 4 sequential jobs running on a node, where I get 99% user time, maybe 1% system time. The processes have plenty of memory. This behavior occurs whether I use processor/memory affinity or not (there is no oversubscription). Thanks, Todd ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Cell EIB support for OpenMPI
Hi, Has anyone investigated adding intra chip Cell EIB messaging to OpenMPI? It seems like it ought to work. This paper seems pretty convincing: http://www.cs.fsu.edu/research/reports/TR-061215.pdf
Re: [OMPI users] Cell EIB support for OpenMPI
That's pretty cool. The main issue with this, and addressed at the end of the report, is that the code size is going to be a problem as data and code must live in the same 256KB in each SPE. They mention dynamic overlay loading, which is also how we deal with large code size, but things get tricky and slow with the potentially needed save and restore of registers and LS. It would be interesting to see how much of MPI could be implemented and how much is really needed. Maybe it's time to think about and MPI-ES spec? -Mike Marcus G. Daniels wrote: Hi, Has anyone investigated adding intra chip Cell EIB messaging to OpenMPI? It seems like it ought to work. This paper seems pretty convincing: http://www.cs.fsu.edu/research/reports/TR-061215.pdf ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI processes swapping out
Ralph, Well, according to the FAQ, aggressive mode can be "forced" so I did try setting OMPI_MCA_mpi_yield_when_idle=0 before running. I also tried turning processor/memory affinity on. Efffects were minor. The MPI tasks still cycle bewteen run and sleep states, driving up system time well over user time. Mpstat shows SGE is indeed giving 4 or 2 slots per node as approporiate (depending on memory) and the MPI tasks are using 4 or 2 cores, but to be sure, I also tried running directly with a hostfile with slots=4 or slots=2. The same behavior occurs. This behavior is a function of the size of the job. I.e. As I scale from 200 to 800 tasks the run/sleep cycling increases, so that system time grows from maybe half the user time to maybe 5 times user time. This is for TCP/gigE. Todd On 3/22/07 12:19 PM, "Ralph Castain" wrote: > Just for clarification: ompi_info only shows the *default* value of the MCA > parameter. In this case, mpi_yield_when_idle defaults to aggressive, but > that value is reset internally if the system sees an "oversubscribed" > condition. > > The issue here isn't how many cores are on the node, but rather how many > were specifically allocated to this job. If the allocation wasn't at least 2 > (in your example), then we would automatically reset mpi_yield_when_idle to > be non-aggressive, regardless of how many cores are actually on the node. > > Ralph > > > On 3/22/07 7:14 AM, "Heywood, Todd" wrote: > >> Yes, I'm using SGE. I also just noticed that when 2 tasks/slots run on a >> 4-core node, the 2 tasks are still cycling between run and sleep, with >> higher system time than user time. >> >> Ompi_info shows the MCA parameter mpi_yield_when_idle to be 0 (aggressive), >> so that suggests the tasks aren't swapping out on bloccking calls. >> >> Still puzzled. >> >> Thanks, >> Todd >> >> >> On 3/22/07 7:36 AM, "Jeff Squyres" wrote: >> >>> Are you using a scheduler on your system? >>> >>> More specifically, does Open MPI know that you have for process slots >>> on each node? If you are using a hostfile and didn't specify >>> "slots=4" for each host, Open MPI will think that it's >>> oversubscribing and will therefore call sched_yield() in the depths >>> of its progress engine. >>> >>> >>> On Mar 21, 2007, at 5:08 PM, Heywood, Todd wrote: >>> P.s. I should have said this this is a pretty course-grained application, and netstat doesn't show much communication going on (except in stages). On 3/21/07 4:21 PM, "Heywood, Todd" wrote: > I noticed that my OpenMPI processes are using larger amounts of > system time > than user time (via vmstat, top). I'm running on dual-core, dual-CPU > Opterons, with 4 slots per node, where the program has the nodes to > themselves. A closer look showed that they are constantly > switching between > run and sleep states with 4-8 page faults per second. > > Why would this be? It doesn't happen with 4 sequential jobs > running on a > node, where I get 99% user time, maybe 1% system time. > > The processes have plenty of memory. This behavior occurs whether > I use > processor/memory affinity or not (there is no oversubscription). > > Thanks, > > Todd > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI processes swapping out
On 3/22/07 11:30 AM, "Heywood, Todd" wrote: > Ralph, > > Well, according to the FAQ, aggressive mode can be "forced" so I did try > setting OMPI_MCA_mpi_yield_when_idle=0 before running. I also tried turning > processor/memory affinity on. Efffects were minor. The MPI tasks still cycle > bewteen run and sleep states, driving up system time well over user time. Yes, that's true - and we do (should) respect any such directive. > > Mpstat shows SGE is indeed giving 4 or 2 slots per node as approporiate > (depending on memory) and the MPI tasks are using 4 or 2 cores, but to be > sure, I also tried running directly with a hostfile with slots=4 or slots=2. > The same behavior occurs. Okay - thanks for trying that! > > This behavior is a function of the size of the job. I.e. As I scale from 200 > to 800 tasks the run/sleep cycling increases, so that system time grows from > maybe half the user time to maybe 5 times user time. > > This is for TCP/gigE. What version of OpenMPI are you using? This sounds like something we need to investigate. Thanks for the help! Ralph > > Todd > > > On 3/22/07 12:19 PM, "Ralph Castain" wrote: > >> Just for clarification: ompi_info only shows the *default* value of the MCA >> parameter. In this case, mpi_yield_when_idle defaults to aggressive, but >> that value is reset internally if the system sees an "oversubscribed" >> condition. >> >> The issue here isn't how many cores are on the node, but rather how many >> were specifically allocated to this job. If the allocation wasn't at least 2 >> (in your example), then we would automatically reset mpi_yield_when_idle to >> be non-aggressive, regardless of how many cores are actually on the node. >> >> Ralph >> >> >> On 3/22/07 7:14 AM, "Heywood, Todd" wrote: >> >>> Yes, I'm using SGE. I also just noticed that when 2 tasks/slots run on a >>> 4-core node, the 2 tasks are still cycling between run and sleep, with >>> higher system time than user time. >>> >>> Ompi_info shows the MCA parameter mpi_yield_when_idle to be 0 (aggressive), >>> so that suggests the tasks aren't swapping out on bloccking calls. >>> >>> Still puzzled. >>> >>> Thanks, >>> Todd >>> >>> >>> On 3/22/07 7:36 AM, "Jeff Squyres" wrote: >>> Are you using a scheduler on your system? More specifically, does Open MPI know that you have for process slots on each node? If you are using a hostfile and didn't specify "slots=4" for each host, Open MPI will think that it's oversubscribing and will therefore call sched_yield() in the depths of its progress engine. On Mar 21, 2007, at 5:08 PM, Heywood, Todd wrote: > P.s. I should have said this this is a pretty course-grained > application, > and netstat doesn't show much communication going on (except in > stages). > > > On 3/21/07 4:21 PM, "Heywood, Todd" wrote: > >> I noticed that my OpenMPI processes are using larger amounts of >> system time >> than user time (via vmstat, top). I'm running on dual-core, dual-CPU >> Opterons, with 4 slots per node, where the program has the nodes to >> themselves. A closer look showed that they are constantly >> switching between >> run and sleep states with 4-8 page faults per second. >> >> Why would this be? It doesn't happen with 4 sequential jobs >> running on a >> node, where I get 99% user time, maybe 1% system time. >> >> The processes have plenty of memory. This behavior occurs whether >> I use >> processor/memory affinity or not (there is no oversubscription). >> >> Thanks, >> >> Todd >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI processes swapping out
Hi, It is v1.2, default configuration. If it matters: OS is RHEL (2.6.9-42.0.3.ELsmp) on x86_64. I have noticed this for 2 apps so far, mpiBLAST and HPL, which are both course grained. Thanks, Todd On 3/22/07 2:38 PM, "Ralph Castain" wrote: > > > > On 3/22/07 11:30 AM, "Heywood, Todd" wrote: > >> Ralph, >> >> Well, according to the FAQ, aggressive mode can be "forced" so I did try >> setting OMPI_MCA_mpi_yield_when_idle=0 before running. I also tried turning >> processor/memory affinity on. Efffects were minor. The MPI tasks still cycle >> bewteen run and sleep states, driving up system time well over user time. > > Yes, that's true - and we do (should) respect any such directive. > >> >> Mpstat shows SGE is indeed giving 4 or 2 slots per node as approporiate >> (depending on memory) and the MPI tasks are using 4 or 2 cores, but to be >> sure, I also tried running directly with a hostfile with slots=4 or slots=2. >> The same behavior occurs. > > Okay - thanks for trying that! > >> >> This behavior is a function of the size of the job. I.e. As I scale from 200 >> to 800 tasks the run/sleep cycling increases, so that system time grows from >> maybe half the user time to maybe 5 times user time. >> >> This is for TCP/gigE. > > What version of OpenMPI are you using? This sounds like something we need to > investigate. > > Thanks for the help! > Ralph > >> >> Todd >> >> >> On 3/22/07 12:19 PM, "Ralph Castain" wrote: >> >>> Just for clarification: ompi_info only shows the *default* value of the MCA >>> parameter. In this case, mpi_yield_when_idle defaults to aggressive, but >>> that value is reset internally if the system sees an "oversubscribed" >>> condition. >>> >>> The issue here isn't how many cores are on the node, but rather how many >>> were specifically allocated to this job. If the allocation wasn't at least 2 >>> (in your example), then we would automatically reset mpi_yield_when_idle to >>> be non-aggressive, regardless of how many cores are actually on the node. >>> >>> Ralph >>> >>> >>> On 3/22/07 7:14 AM, "Heywood, Todd" wrote: >>> Yes, I'm using SGE. I also just noticed that when 2 tasks/slots run on a 4-core node, the 2 tasks are still cycling between run and sleep, with higher system time than user time. Ompi_info shows the MCA parameter mpi_yield_when_idle to be 0 (aggressive), so that suggests the tasks aren't swapping out on bloccking calls. Still puzzled. Thanks, Todd On 3/22/07 7:36 AM, "Jeff Squyres" wrote: > Are you using a scheduler on your system? > > More specifically, does Open MPI know that you have for process slots > on each node? If you are using a hostfile and didn't specify > "slots=4" for each host, Open MPI will think that it's > oversubscribing and will therefore call sched_yield() in the depths > of its progress engine. > > > On Mar 21, 2007, at 5:08 PM, Heywood, Todd wrote: > >> P.s. I should have said this this is a pretty course-grained >> application, >> and netstat doesn't show much communication going on (except in >> stages). >> >> >> On 3/21/07 4:21 PM, "Heywood, Todd" wrote: >> >>> I noticed that my OpenMPI processes are using larger amounts of >>> system time >>> than user time (via vmstat, top). I'm running on dual-core, dual-CPU >>> Opterons, with 4 slots per node, where the program has the nodes to >>> themselves. A closer look showed that they are constantly >>> switching between >>> run and sleep states with 4-8 page faults per second. >>> >>> Why would this be? It doesn't happen with 4 sequential jobs >>> running on a >>> node, where I get 99% user time, maybe 1% system time. >>> >>> The processes have plenty of memory. This behavior occurs whether >>> I use >>> processor/memory affinity or not (there is no oversubscription). >>> >>> Thanks, >>> >>> Todd >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://w
Re: [OMPI users] Cell EIB support for OpenMPI
Mike Houston wrote: The main issue with this, and addressed at the end of the report, is that the code size is going to be a problem as data and code must live in the same 256KB in each SPE. Just for reference, here are the stripped shared library sizes for OpenMPI 1.2 as built on a Mercury Cell system. This is for the PPU, not SPU.. -rwxr-xr-x 1 mdaniels world 11216 Mar 22 2007 libmca_common_sm.so.0.0.0 -rwxr-xr-x 1 mdaniels world 191440 Mar 22 2007 libmpi_cxx.so.0.0.0 -rwxr-xr-x 1 mdaniels world 827440 Mar 22 2007 libmpi.so.0.0.0 -rwxr-xr-x 1 mdaniels world 327912 Mar 22 2007 libopen-pal.so.0.0.0 -rwxr-xr-x 1 mdaniels world 556584 Mar 22 2007 libopen-rte.so.0.0.0 Using -Os instead of -O3: -rwxr-xr-x 1 mdaniels world 11232 Mar 22 2007 libmca_common_sm.so.0.0.0 -rwxr-xr-x 1 mdaniels world 258280 Mar 22 2007 libmpi_cxx.so.0.0.0 -rwxr-xr-x 1 mdaniels world 749688 Mar 22 2007 libmpi.so.0.0.0 -rwxr-xr-x 1 mdaniels world 296648 Mar 22 2007 libopen-pal.so.0.0.0 -rwxr-xr-x 1 mdaniels world 501712 Mar 22 2007 libopen-rte.so.0.0.0
[OMPI users] hostfile syntax
Does the hostfile understand the syntax: mybox cpu=4 I have some legacy code and scripts that I'd like to move without modifying if possible. I understand the syntax is supposed to be: mybox slots=4 but using "cpu" seems to work. Does that achieve the same thing? -geoff
[OMPI users] Buffered sends
Is there known issue with buffered sends in OpenMPI 1.1.4? I changed a single send which is called thousands of times from MPI_SEND (& MPI_ISEND) to MPI_BSEND (& MPI_IBSEND) and my Fortran 90 code slowed down by a factor of 10. I've looked at several references and I can't see where I'm making a mistake. The MPI_SEND is for MPI_PACKED data, so it's first parameter is an allocated character array. I also allocated a character array for the buffer passed to MPI_BUFFER_ATTACH. Looking at the model implementation in a reference they give a model of using MPI_PACKED inside MPI_BSEND, I was wondering if this could be a problem, i.e. packing packed data? Michael ps. I have to use OpenMPI 1.1.4 to maintain compatibility with a major HPC center.
Re: [OMPI users] Buffered sends
This problem is not related to Open MPI. Is related to the way you use MPI. In fact there are 2 problems: 1. Buffered sends will copy the data into the attached buffer. In your case, I think this only add one more memcpy operation to the critical path, which might partially explain the impressive slow-down (but I don't think this is the main reason). Buffering an MPI_PACKED data seems like a non optimal solution. You want to keep the critical path as short as possible and avoid any extra/useless memcopy. Using a double buffering technique (which will effectively double the amount of memory required for your communications) can give you some benefit. 2. Once the data is buffered, the Bsend (and the Ibsend) return to the user application without progressing the communication. With few exceptions (based on the available networks, which is not the case for TCP nor shared memory) the point-to-point communication will only be progressed on the next MPI call. If you look in the MPI standard to see what exactly means to return from a blocking send, you will realize that the only requirement is that the user can touch the send buffer. From this perspective, the major difference between a MPI_Send and an MPI_Bsend operation is that the MPI_Send will return once the data is delivered to the NIC (which then can then complete the communication asynchronously), while at the end of the MPI_Bsend the data is still in the application memory. The only way to get any benefit from the MPI_Bsend is to have a progress thread which take care of the pending communications in the background. Such thread is not enabled by default in Open MPI. Thanks, george. On Mar 22, 2007, at 5:18 PM, Michael wrote: Is there known issue with buffered sends in OpenMPI 1.1.4? I changed a single send which is called thousands of times from MPI_SEND (& MPI_ISEND) to MPI_BSEND (& MPI_IBSEND) and my Fortran 90 code slowed down by a factor of 10. I've looked at several references and I can't see where I'm making a mistake. The MPI_SEND is for MPI_PACKED data, so it's first parameter is an allocated character array. I also allocated a character array for the buffer passed to MPI_BUFFER_ATTACH. Looking at the model implementation in a reference they give a model of using MPI_PACKED inside MPI_BSEND, I was wondering if this could be a problem, i.e. packing packed data? Michael ps. I have to use OpenMPI 1.1.4 to maintain compatibility with a major HPC center. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Fwd: [Allinea #6458] message queues
We use OpenMPI as our default MPI lib on our clusters. We are starting to do some work with parallel debuggers (ddt to be exact) and was wondering what the time line for message queue debugging was. Just curious! Thanks. Brock Palen Center for Advanced Computing bro...@umich.edu (734)936-1985 Begin forwarded message: From: "David Lecomber" Date: March 22, 2007 6:44:35 PM GMT-04:00 To: bro...@umich.edu Cc: jacq...@allinea.com Subject: Re: [Allinea #6458] message queues Reply-To: supp...@allinea.com Hi Brock This question has just become "frequently asked" -- no-one asked in all of the last 12 months, and I think you're the third person this month, the second today! OpenMPI does not (yet?) support message queue debugging - this means the interface just isn't there for a debugger to get the information, sadly. Open-MPI's own FAQ does mention this lack of support, but I'm not sure of an ETA or whether they are actively developing it. Best wishes David On Thu, 2007-03-22 at 20:08 +, Brock Palen wrote: Thu Mar 22 20:08:41 2007: Request 6458 was acted upon. Transaction: Ticket created by bro...@umich.edu Queue: support Subject: message queues Owner: Nobody Requestors: bro...@umich.edu Status: new Ticket http://swtracker//Ticket/Display.html?id=6458 > Hello, According the the manual if you get the message "unable to load message queue library" to look at the FAQ, but i can not find a faq anyplace. We are new users and our mpi lib is openmpi-1.0.2 Brock Palen Center for Advanced Computing bro...@umich.edu (734)936-1985 -- David Lecomber CTO Allinea Software Tel: +44 1926 623231 Fax: +44 1926 623232 signature.asc Description: PGP signature PGP.sig Description: This is a digitally signed message part
Re: [OMPI users] implementation of a message logging protocol
Thomas, We are working on this topic at the University of Tennessee. In fact, 2 of the MPICH-V guys are now working on Open MPI on fault tolerant aspects. With all the expertise we gathered doing MPICH-V, we decided to take a different approach and to take advantage of the modular architecture offered by Open MPI. We don't focus on any specific message logging protocol right now (but we expect to have at least all those present in MPICH-V3). Instead what we target is a generic framework which will allow researchers to implement in a simple and straightforward way any message logging protocols they want, as well as providing all tools required to make their life easier. The code is not yet in the Open MPI trunk but it will get there soon. We expect to be able to start moving the message logging framework in the trunk over the next month. Thanks, george. On Mar 22, 2007, at 4:48 AM, Thomas Ropars wrote: Dear all, I am currently working on a fault tolerant protocol for message passing applications based on message logging. For my experimentations, I want to implement my protocol in a MPI library. I know that message logging protocols have already been implemented in MPICH with MPICH-V. I'm wondering if in the actual state of Open MPI it is possible to do the same kind of work in this library ? Is there somebody currently working on the same subject ? Best regards, Thomas Ropars. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Fwd: [Allinea #6458] message queues
Open MPI have support for one parallel debugger: Total View. I don't know how DDT interact with the MPI library in order to get access to the message queues, but we provide a library which allow tv to get access to the internal representation of the message queues in Open MPI. The access to the message queue in complete, you can see all pending sends, receives as well as unexpected messages. Thanks, george. On Mar 22, 2007, at 7:10 PM, Brock Palen wrote: We use OpenMPI as our default MPI lib on our clusters. We are starting to do some work with parallel debuggers (ddt to be exact) and was wondering what the time line for message queue debugging was. Just curious! Thanks. Brock Palen Center for Advanced Computing bro...@umich.edu (734)936-1985 Begin forwarded message: From: "David Lecomber" Date: March 22, 2007 6:44:35 PM GMT-04:00 To: bro...@umich.edu Cc: jacq...@allinea.com Subject: Re: [Allinea #6458] message queues Reply-To: supp...@allinea.com Hi Brock This question has just become "frequently asked" -- no-one asked in all of the last 12 months, and I think you're the third person this month, the second today! OpenMPI does not (yet?) support message queue debugging - this means the interface just isn't there for a debugger to get the information, sadly. Open-MPI's own FAQ does mention this lack of support, but I'm not sure of an ETA or whether they are actively developing it. Best wishes David On Thu, 2007-03-22 at 20:08 +, Brock Palen wrote: Thu Mar 22 20:08:41 2007: Request 6458 was acted upon. Transaction: Ticket created by bro...@umich.edu Queue: support Subject: message queues Owner: Nobody Requestors: bro...@umich.edu Status: new Ticket http://swtracker//Ticket/Display.html?id=6458 > Hello, According the the manual if you get the message "unable to load message queue library" to look at the FAQ, but i can not find a faq anyplace. We are new users and our mpi lib is openmpi-1.0.2 Brock Palen Center for Advanced Computing bro...@umich.edu (734)936-1985 -- David Lecomber CTO Allinea Software Tel: +44 1926 623231 Fax: +44 1926 623232 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] hostfile syntax
Geoff, 'cpu', 'slots', and 'count' all do exactly the same thing. Tim On Thursday 22 March 2007 03:03 pm, Geoff Galitz wrote: > Does the hostfile understand the syntax: > > mybox cpu=4 > > I have some legacy code and scripts that I'd like to move without > modifying if possible. I understand the syntax is supposed to be: > > mybox slots=4 > > but using "cpu" seems to work. Does that achieve the same thing? > > -geoff > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users