Re: [OMPI users] MPI processes swapping out

Ralph Castain Thu, 22 Mar 2007 14:38:48 -0400

On 3/22/07 11:30 AM, "Heywood, Todd" <heyw...@cshl.edu> wrote:

> Ralph,
> 
> Well, according to the FAQ, aggressive mode can be "forced" so I did try
> setting OMPI_MCA_mpi_yield_when_idle=0 before running. I also tried turning
> processor/memory affinity on. Efffects were minor. The MPI tasks still cycle
> bewteen run and sleep states, driving up system time well over user time.

Yes, that's true - and we do (should) respect any such directive.

> 
> Mpstat shows SGE is indeed giving 4 or 2 slots per node as approporiate
> (depending on memory) and the MPI tasks are using 4 or 2 cores, but to be
> sure, I also tried running directly with a hostfile with slots=4 or slots=2.
> The same behavior occurs.

Okay - thanks for trying that!

> 
> This behavior is a function of the size of the job. I.e. As I scale from 200
> to 800 tasks the run/sleep cycling increases, so that system time grows from
> maybe half the user time to maybe 5 times user time.
> 
> This is for TCP/gigE.

What version of OpenMPI are you using? This sounds like something we need to
investigate.

Thanks for the help!
Ralph

> 
> Todd
> 
> 
> On 3/22/07 12:19 PM, "Ralph Castain" <r...@lanl.gov> wrote:
> 
>> Just for clarification: ompi_info only shows the *default* value of the MCA
>> parameter. In this case, mpi_yield_when_idle defaults to aggressive, but
>> that value is reset internally if the system sees an "oversubscribed"
>> condition.
>> 
>> The issue here isn't how many cores are on the node, but rather how many
>> were specifically allocated to this job. If the allocation wasn't at least 2
>> (in your example), then we would automatically reset mpi_yield_when_idle to
>> be non-aggressive, regardless of how many cores are actually on the node.
>> 
>> Ralph
>> 
>> 
>> On 3/22/07 7:14 AM, "Heywood, Todd" <heyw...@cshl.edu> wrote:
>> 
>>> Yes, I'm using SGE. I also just noticed that when 2 tasks/slots run on a
>>> 4-core node, the 2 tasks are still cycling between run and sleep, with
>>> higher system time than user time.
>>> 
>>> Ompi_info shows the MCA parameter mpi_yield_when_idle to be 0 (aggressive),
>>> so that suggests the tasks aren't swapping out on bloccking calls.
>>> 
>>> Still puzzled.
>>> 
>>> Thanks,
>>> Todd
>>> 
>>> 
>>> On 3/22/07 7:36 AM, "Jeff Squyres" <jsquy...@cisco.com> wrote:
>>> 
>>>> Are you using a scheduler on your system?
>>>> 
>>>> More specifically, does Open MPI know that you have for process slots
>>>> on each node?  If you are using a hostfile and didn't specify
>>>> "slots=4" for each host, Open MPI will think that it's
>>>> oversubscribing and will therefore call sched_yield() in the depths
>>>> of its progress engine.
>>>> 
>>>> 
>>>> On Mar 21, 2007, at 5:08 PM, Heywood, Todd wrote:
>>>> 
>>>>> P.s. I should have said this this is a pretty course-grained
>>>>> application,
>>>>> and netstat doesn't show much communication going on (except in
>>>>> stages).
>>>>> 
>>>>> 
>>>>> On 3/21/07 4:21 PM, "Heywood, Todd" <heyw...@cshl.edu> wrote:
>>>>> 
>>>>>> I noticed that my OpenMPI processes are using larger amounts of
>>>>>> system time
>>>>>> than user time (via vmstat, top). I'm running on dual-core, dual-CPU
>>>>>> Opterons, with 4 slots per node, where the program has the nodes to
>>>>>> themselves. A closer look showed that they are constantly
>>>>>> switching between
>>>>>> run and sleep states with 4-8 page faults per second.
>>>>>> 
>>>>>> Why would this be? It doesn't happen with 4 sequential jobs
>>>>>> running on a
>>>>>> node, where I get 99% user time, maybe 1% system time.
>>>>>> 
>>>>>> The processes have plenty of memory. This behavior occurs whether
>>>>>> I use
>>>>>> processor/memory affinity or not (there is no oversubscription).
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Todd
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI processes swapping out

Reply via email to