Re: [OMPI users] Shared Memory Performance Problem.

Michele Marena Mon, 28 Mar 2011 06:29:25 -0400

Each node have two processors (no dual-core).

2011/3/28 Michele Marena <michelemar...@gmail.com>


> However, I thank you Tim, Ralh and Jeff.
> My sequential application runs in 24s (wall clock time).
> My parallel application runs in 13s with two processes on different nodes.
> With shared memory, when two processes are on the same node, my app runs in
> 23s.
> I'm not understand why.
>
> 2011/3/28 Jeff Squyres <jsquy...@cisco.com>
>
>> If your program runs faster across 3 processes, 2 of which are local to
>> each other, with --mca btl tcp,self compared to --mca btl tcp,sm,self, then
>> something is very, very strange.
>>
>> Tim cites all kinds of things that can cause slowdowns, but it's still
>> very, very odd that simply enabling using the shared memory communications
>> channel in Open MPI *slows your overall application down*.
>>
>> How much does your application slow down in wall clock time?  Seconds?
>>  Minutes?  Hours?  (anything less than 1 second is in the noise)
>>
>>
>>
>> On Mar 27, 2011, at 10:33 AM, Ralph Castain wrote:
>>
>> >
>> > On Mar 27, 2011, at 7:37 AM, Tim Prince wrote:
>> >
>> >> On 3/27/2011 2:26 AM, Michele Marena wrote:
>> >>> Hi,
>> >>> My application performs good without shared memory utilization, but
>> with
>> >>> shared memory I get performance worst than without of it.
>> >>> Do I make a mistake? Don't I pay attention to something?
>> >>> I know OpenMPI uses /tmp directory to allocate shared memory and it is
>> >>> in the local filesystem.
>> >>>
>> >>
>> >> I guess you mean shared memory message passing.   Among relevant
>> parameters may be the message size where your implementation switches from
>> cached copy to non-temporal (if you are on a platform where that terminology
>> is used).  If built with Intel compilers, for example, the copy may be
>> performed by intel_fast_memcpy, with a default setting which uses
>> non-temporal when the message exceeds about some preset size, e.g. 50% of
>> smallest L2 cache for that architecture.
>> >> A quick search for past posts seems to indicate that OpenMPI doesn't
>> itself invoke non-temporal, but there appear to be several useful articles
>> not connected with OpenMPI.
>> >> In case guesses aren't sufficient, it's often necessary to profile
>> (gprof, oprofile, Vtune, ....) to pin this down.
>> >> If shared message slows your application down, the question is whether
>> this is due to excessive eviction of data from cache; not a simple question,
>> as most recent CPUs have 3 levels of cache, and your application may require
>> more or less data which was in use prior to the message receipt, and may use
>> immediately only a small piece of a large message.
>> >
>> > There were several papers published in earlier years about shared memory
>> performance in the 1.2 series.There were known problems with that
>> implementation, which is why it was heavily revised for the 1.3/4 series.
>> >
>> > You might also look at the following links, though much of it has been
>> updated to the 1.3/4 series as we don't really support 1.2 any more:
>> >
>> > http://www.open-mpi.org/faq/?category=sm
>> >
>> > http://www.open-mpi.org/faq/?category=perftools
>> >
>> >
>> >>
>> >> --
>> >> Tim Prince
>> >> _______________________________________________
>> >> users mailing list
>> >> us...@open-mpi.org
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> > _______________________________________________
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>

Re: [OMPI users] Shared Memory Performance Problem.

Reply via email to