Ralph, Well, according to the FAQ, aggressive mode can be "forced" so I did try setting OMPI_MCA_mpi_yield_when_idle=0 before running. I also tried turning processor/memory affinity on. Efffects were minor. The MPI tasks still cycle bewteen run and sleep states, driving up system time well over user time.
Mpstat shows SGE is indeed giving 4 or 2 slots per node as approporiate (depending on memory) and the MPI tasks are using 4 or 2 cores, but to be sure, I also tried running directly with a hostfile with slots=4 or slots=2. The same behavior occurs. This behavior is a function of the size of the job. I.e. As I scale from 200 to 800 tasks the run/sleep cycling increases, so that system time grows from maybe half the user time to maybe 5 times user time. This is for TCP/gigE. Todd On 3/22/07 12:19 PM, "Ralph Castain" <r...@lanl.gov> wrote: > Just for clarification: ompi_info only shows the *default* value of the MCA > parameter. In this case, mpi_yield_when_idle defaults to aggressive, but > that value is reset internally if the system sees an "oversubscribed" > condition. > > The issue here isn't how many cores are on the node, but rather how many > were specifically allocated to this job. If the allocation wasn't at least 2 > (in your example), then we would automatically reset mpi_yield_when_idle to > be non-aggressive, regardless of how many cores are actually on the node. > > Ralph > > > On 3/22/07 7:14 AM, "Heywood, Todd" <heyw...@cshl.edu> wrote: > >> Yes, I'm using SGE. I also just noticed that when 2 tasks/slots run on a >> 4-core node, the 2 tasks are still cycling between run and sleep, with >> higher system time than user time. >> >> Ompi_info shows the MCA parameter mpi_yield_when_idle to be 0 (aggressive), >> so that suggests the tasks aren't swapping out on bloccking calls. >> >> Still puzzled. >> >> Thanks, >> Todd >> >> >> On 3/22/07 7:36 AM, "Jeff Squyres" <jsquy...@cisco.com> wrote: >> >>> Are you using a scheduler on your system? >>> >>> More specifically, does Open MPI know that you have for process slots >>> on each node? If you are using a hostfile and didn't specify >>> "slots=4" for each host, Open MPI will think that it's >>> oversubscribing and will therefore call sched_yield() in the depths >>> of its progress engine. >>> >>> >>> On Mar 21, 2007, at 5:08 PM, Heywood, Todd wrote: >>> >>>> P.s. I should have said this this is a pretty course-grained >>>> application, >>>> and netstat doesn't show much communication going on (except in >>>> stages). >>>> >>>> >>>> On 3/21/07 4:21 PM, "Heywood, Todd" <heyw...@cshl.edu> wrote: >>>> >>>>> I noticed that my OpenMPI processes are using larger amounts of >>>>> system time >>>>> than user time (via vmstat, top). I'm running on dual-core, dual-CPU >>>>> Opterons, with 4 slots per node, where the program has the nodes to >>>>> themselves. A closer look showed that they are constantly >>>>> switching between >>>>> run and sleep states with 4-8 page faults per second. >>>>> >>>>> Why would this be? It doesn't happen with 4 sequential jobs >>>>> running on a >>>>> node, where I get 99% user time, maybe 1% system time. >>>>> >>>>> The processes have plenty of memory. This behavior occurs whether >>>>> I use >>>>> processor/memory affinity or not (there is no oversubscription). >>>>> >>>>> Thanks, >>>>> >>>>> Todd >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users