Each node have two processors (no dual-core). 2011/3/28 Michele Marena <michelemar...@gmail.com>
> However, I thank you Tim, Ralh and Jeff. > My sequential application runs in 24s (wall clock time). > My parallel application runs in 13s with two processes on different nodes. > With shared memory, when two processes are on the same node, my app runs in > 23s. > I'm not understand why. > > 2011/3/28 Jeff Squyres <jsquy...@cisco.com> > >> If your program runs faster across 3 processes, 2 of which are local to >> each other, with --mca btl tcp,self compared to --mca btl tcp,sm,self, then >> something is very, very strange. >> >> Tim cites all kinds of things that can cause slowdowns, but it's still >> very, very odd that simply enabling using the shared memory communications >> channel in Open MPI *slows your overall application down*. >> >> How much does your application slow down in wall clock time? Seconds? >> Minutes? Hours? (anything less than 1 second is in the noise) >> >> >> >> On Mar 27, 2011, at 10:33 AM, Ralph Castain wrote: >> >> > >> > On Mar 27, 2011, at 7:37 AM, Tim Prince wrote: >> > >> >> On 3/27/2011 2:26 AM, Michele Marena wrote: >> >>> Hi, >> >>> My application performs good without shared memory utilization, but >> with >> >>> shared memory I get performance worst than without of it. >> >>> Do I make a mistake? Don't I pay attention to something? >> >>> I know OpenMPI uses /tmp directory to allocate shared memory and it is >> >>> in the local filesystem. >> >>> >> >> >> >> I guess you mean shared memory message passing. Among relevant >> parameters may be the message size where your implementation switches from >> cached copy to non-temporal (if you are on a platform where that terminology >> is used). If built with Intel compilers, for example, the copy may be >> performed by intel_fast_memcpy, with a default setting which uses >> non-temporal when the message exceeds about some preset size, e.g. 50% of >> smallest L2 cache for that architecture. >> >> A quick search for past posts seems to indicate that OpenMPI doesn't >> itself invoke non-temporal, but there appear to be several useful articles >> not connected with OpenMPI. >> >> In case guesses aren't sufficient, it's often necessary to profile >> (gprof, oprofile, Vtune, ....) to pin this down. >> >> If shared message slows your application down, the question is whether >> this is due to excessive eviction of data from cache; not a simple question, >> as most recent CPUs have 3 levels of cache, and your application may require >> more or less data which was in use prior to the message receipt, and may use >> immediately only a small piece of a large message. >> > >> > There were several papers published in earlier years about shared memory >> performance in the 1.2 series.There were known problems with that >> implementation, which is why it was heavily revised for the 1.3/4 series. >> > >> > You might also look at the following links, though much of it has been >> updated to the 1.3/4 series as we don't really support 1.2 any more: >> > >> > http://www.open-mpi.org/faq/?category=sm >> > >> > http://www.open-mpi.org/faq/?category=perftools >> > >> > >> >> >> >> -- >> >> Tim Prince >> >> _______________________________________________ >> >> users mailing list >> >> us...@open-mpi.org >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> > >> > _______________________________________________ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >