Re: [OMPI users] Shared Memory Performance Problem.

Michele Marena Wed, 30 Mar 2011 11:18:25 -0400

Hi Jeff,
I thank you for your help,
I've launched my app with mpiP both when two processes are on different node
and when two processes are on the same node.


The process 0 is the manager (gathers the results only), processes 1 and 2
are  workers (compute).

This is the case processes 1 and 2 are on different nodes (runs in 162s).

---------------------------------------------------------------------------
@--- MPI Time (seconds) ---------------------------------------------------
---------------------------------------------------------------------------
Task    AppTime    MPITime     MPI%
   0        162        162    99.99
   1        162       30.2    18.66
   2        162       14.7     9.04
   *        486        207    42.56
---------------------------------------------------------------------------
@--- Aggregate Time (top twenty, descending, milliseconds) ----------------
---------------------------------------------------------------------------
Call                 Site       Time    App%    MPI%     COV
Barrier                 5   1.28e+05   26.24   61.64    0.00
Barrier                14    2.3e+04    4.74   11.13    0.00
Barrier                 6   2.29e+04    4.72   11.08    0.00
Barrier                17   1.77e+04    3.65    8.58    1.41
Recv                    3   1.15e+04    2.37    5.58    0.00
Recv                   30   2.26e+03    0.47    1.09    0.00
Recv                   12        308    0.06    0.15    0.00
Recv                   26        286    0.06    0.14    0.00
Recv                   28        252    0.05    0.12    0.00
Recv                   31        246    0.05    0.12    0.00
Isend                  35        111    0.02    0.05    0.00
Isend                  34        108    0.02    0.05    0.00
Isend                  18        107    0.02    0.05    0.00
Isend                  19        105    0.02    0.05    0.00
Isend                   9       57.7    0.01    0.03    0.05
Isend                  32       39.7    0.01    0.02    0.00
Barrier                25       38.7    0.01    0.02    1.39
Isend                  11       38.6    0.01    0.02    0.00
Recv                   16       34.1    0.01    0.02    0.00
Recv                   27       26.5    0.01    0.01    0.00
---------------------------------------------------------------------------
@--- Aggregate Sent Message Size (top twenty, descending, bytes) ----------
---------------------------------------------------------------------------
Call                 Site      Count      Total       Avrg  Sent%
Isend                   9       4096   1.34e+08   3.28e+04  58.73
Isend                  34       1200   1.85e+07   1.54e+04   8.07
Isend                  35       1200   1.85e+07   1.54e+04   8.07
Isend                  18       1200   1.85e+07   1.54e+04   8.07
Isend                  19       1200   1.85e+07   1.54e+04   8.07
Isend                  32        240   3.69e+06   1.54e+04   1.61
Isend                  11        240   3.69e+06   1.54e+04   1.61
Isend                  15        180   3.44e+06   1.91e+04   1.51
Isend                  33         61      2e+06   3.28e+04   0.87
Isend                  10         61      2e+06   3.28e+04   0.87
Isend                  29         61      2e+06   3.28e+04   0.87
Isend                  22         61      2e+06   3.28e+04   0.87
Isend                  37        180   1.72e+06   9.57e+03   0.75
Isend                  24          2         16          8   0.00
Isend                  20          2         16          8   0.00
Send                    8          1          4          4   0.00
Send                    1          1          4          4   0.00

The case when processes 1 and 2 are on the same node (runs in 260s).
---------------------------------------------------------------------------
@--- MPI Time (seconds) ---------------------------------------------------
---------------------------------------------------------------------------
Task    AppTime    MPITime     MPI%
   0        260        260    99.99
   1        260       39.7    15.29
   2        260       26.4    10.17
   *        779        326    41.82

---------------------------------------------------------------------------
@--- Aggregate Time (top twenty, descending, milliseconds) ----------------
---------------------------------------------------------------------------
Call                 Site       Time    App%    MPI%     COV
Barrier                 5   2.23e+05   28.64   68.50    0.00
Barrier                 6   2.49e+04    3.20    7.66    0.00
Barrier                14   2.31e+04    2.96    7.09    0.00
Recv                   28   1.35e+04    1.73    4.14    0.00
Recv                   16   1.32e+04    1.70    4.06    0.00
Barrier                17   1.22e+04    1.56    3.74    1.41
Recv                    3   1.16e+04    1.48    3.55    0.00
Recv                   26   1.67e+03    0.21    0.51    0.00
Recv                   30        940    0.12    0.29    0.00
Recv                   12        674    0.09    0.21    0.00
Recv                   21        288    0.04    0.09    0.00
Recv                   31        259    0.03    0.08    0.00
Isend                   9       62.1    0.01    0.02    0.04
Recv                   27       39.5    0.01    0.01    0.00
Isend                  35       31.2    0.00    0.01    0.00
Isend                  19         31    0.00    0.01    0.00
Isend                  34         30    0.00    0.01    0.00
Isend                  18       29.4    0.00    0.01    0.00
Isend                  32       14.6    0.00    0.00    0.00
Isend                  11       14.4    0.00    0.00    0.00
---------------------------------------------------------------------------
@--- Aggregate Sent Message Size (top twenty, descending, bytes) ----------
---------------------------------------------------------------------------
Call                 Site      Count      Total       Avrg  Sent%
Isend                   9       4096   1.34e+08   3.28e+04  58.73
Isend                  34       1200   1.85e+07   1.54e+04   8.07
Isend                  35       1200   1.85e+07   1.54e+04   8.07
Isend                  18       1200   1.85e+07   1.54e+04   8.07
Isend                  19       1200   1.85e+07   1.54e+04   8.07
Isend                  32        240   3.69e+06   1.54e+04   1.61
Isend                  11        240   3.69e+06   1.54e+04   1.61
Isend                  15        180   3.44e+06   1.91e+04   1.51
Isend                  33         61      2e+06   3.28e+04   0.87
Isend                  10         61      2e+06   3.28e+04   0.87
Isend                  29         61      2e+06   3.28e+04   0.87
Isend                  22         61      2e+06   3.28e+04   0.87
Isend                  37        180   1.72e+06   9.57e+03   0.75
Isend                  24          2         16          8   0.00
Isend                  20          2         16          8   0.00
Send                    8          1          4          4   0.00
Send                    1          1          4          4   0.00

I think there's a contention problem on the memory bus. If the Shared Memory
works correctly.
However, the message size is 4096 * sizeof(double). Maybe I are wrong in
this point. Is the message size too huge for shared memory?

2011/3/30 Jeff Squyres <jsquy...@cisco.com>

> How many messages are you sending, and how large are they?  I.e., if your
> message passing is tiny, then the network transport may not be the
> bottleneck here.
>
>
> On Mar 28, 2011, at 9:41 AM, Michele Marena wrote:
>
> > I run ompi_info --param btl sm and this is the output
> >
> >                  MCA btl: parameter "btl_base_debug" (current value: "0")
> >                           If btl_base_debug is 1 standard debug is
> output, if > 1 verbose debug is output
> >                  MCA btl: parameter "btl" (current value: <none>)
> >                           Default selection set of components for the btl
> framework (<none> means "use all components that can be found")
> >                  MCA btl: parameter "btl_base_verbose" (current value:
> "0")
> >                           Verbosity level for the btl framework (0 = no
> verbosity)
> >                  MCA btl: parameter "btl_sm_free_list_num" (current
> value: "8")
> >                  MCA btl: parameter "btl_sm_free_list_max" (current
> value: "-1")
> >                  MCA btl: parameter "btl_sm_free_list_inc" (current
> value: "64")
> >                  MCA btl: parameter "btl_sm_exclusivity" (current value:
> "65535")
> >                  MCA btl: parameter "btl_sm_latency" (current value:
> "100")
> >                  MCA btl: parameter "btl_sm_max_procs" (current value:
> "-1")
> >                  MCA btl: parameter "btl_sm_sm_extra_procs" (current
> value: "2")
> >                  MCA btl: parameter "btl_sm_mpool" (current value: "sm")
> >                  MCA btl: parameter "btl_sm_eager_limit" (current value:
> "4096")
> >                  MCA btl: parameter "btl_sm_max_frag_size" (current
> value: "32768")
> >                  MCA btl: parameter "btl_sm_size_of_cb_queue" (current
> value: "128")
> >                  MCA btl: parameter "btl_sm_cb_lazy_free_freq" (current
> value: "120")
> >                  MCA btl: parameter "btl_sm_priority" (current value:
> "0")
> >                  MCA btl: parameter "btl_base_warn_component_unused"
> (current value: "1")
> >                           This parameter is used to turn on warning
> messages when certain NICs are not used
> >
> >
> > 2011/3/28 Ralph Castain <r...@open-mpi.org>
> > The fact that this exactly matches the time you measured with shared
> memory is suspicious. My guess is that you aren't actually using shared
> memory at all.
> >
> > Does your "ompi_info" output show shared memory as being available? Jeff
> or others may be able to give you some params that would let you check to
> see if sm is actually being used between those procs.
> >
> >
> >
> > On Mar 28, 2011, at 7:51 AM, Michele Marena wrote:
> >
> >> What happens with 2 processes on the same node with tcp?
> >> With --mca btl self,tcp my app runs in 23s.
> >>
> >> 2011/3/28 Jeff Squyres (jsquyres) <jsquy...@cisco.com>
> >> Ah, I didn't catch before that there were more variables than just tcp
> vs. shmem.
> >>
> >> What happens with 2 processes on the same node with tcp?
> >>
> >> Eg, when both procs are on the same node, are you thrashing caches or
> memory?
> >>
> >> Sent from my phone. No type good.
> >>
> >> On Mar 28, 2011, at 6:27 AM, "Michele Marena" <michelemar...@gmail.com>
> wrote:
> >>
> >>> However, I thank you Tim, Ralh and Jeff.
> >>> My sequential application runs in 24s (wall clock time).
> >>> My parallel application runs in 13s with two processes on different
> nodes.
> >>> With shared memory, when two processes are on the same node, my app
> runs in 23s.
> >>> I'm not understand why.
> >>>
> >>> 2011/3/28 Jeff Squyres <jsquy...@cisco.com>
> >>> If your program runs faster across 3 processes, 2 of which are local to
> each other, with --mca btl tcp,self compared to --mca btl tcp,sm,self, then
> something is very, very strange.
> >>>
> >>> Tim cites all kinds of things that can cause slowdowns, but it's still
> very, very odd that simply enabling using the shared memory communications
> channel in Open MPI *slows your overall application down*.
> >>>
> >>> How much does your application slow down in wall clock time?  Seconds?
>  Minutes?  Hours?  (anything less than 1 second is in the noise)
> >>>
> >>>
> >>>
> >>> On Mar 27, 2011, at 10:33 AM, Ralph Castain wrote:
> >>>
> >>> >
> >>> > On Mar 27, 2011, at 7:37 AM, Tim Prince wrote:
> >>> >
> >>> >> On 3/27/2011 2:26 AM, Michele Marena wrote:
> >>> >>> Hi,
> >>> >>> My application performs good without shared memory utilization, but
> with
> >>> >>> shared memory I get performance worst than without of it.
> >>> >>> Do I make a mistake? Don't I pay attention to something?
> >>> >>> I know OpenMPI uses /tmp directory to allocate shared memory and it
> is
> >>> >>> in the local filesystem.
> >>> >>>
> >>> >>
> >>> >> I guess you mean shared memory message passing.   Among relevant
> parameters may be the message size where your implementation switches from
> cached copy to non-temporal (if you are on a platform where that terminology
> is used).  If built with Intel compilers, for example, the copy may be
> performed by intel_fast_memcpy, with a default setting which uses
> non-temporal when the message exceeds about some preset size, e.g. 50% of
> smallest L2 cache for that architecture.
> >>> >> A quick search for past posts seems to indicate that OpenMPI doesn't
> itself invoke non-temporal, but there appear to be several useful articles
> not connected with OpenMPI.
> >>> >> In case guesses aren't sufficient, it's often necessary to profile
> (gprof, oprofile, Vtune, ....) to pin this down.
> >>> >> If shared message slows your application down, the question is
> whether this is due to excessive eviction of data from cache; not a simple
> question, as most recent CPUs have 3 levels of cache, and your application
> may require more or less data which was in use prior to the message receipt,
> and may use immediately only a small piece of a large message.
> >>> >
> >>> > There were several papers published in earlier years about shared
> memory performance in the 1.2 series.There were known problems with that
> implementation, which is why it was heavily revised for the 1.3/4 series.
> >>> >
> >>> > You might also look at the following links, though much of it has
> been updated to the 1.3/4 series as we don't really support 1.2 any more:
> >>> >
> >>> > http://www.open-mpi.org/faq/?category=sm
> >>> >
> >>> > http://www.open-mpi.org/faq/?category=perftools
> >>> >
> >>> >
> >>> >>
> >>> >> --
> >>> >> Tim Prince
> >>> >> _______________________________________________
> >>> >> users mailing list
> >>> >> us...@open-mpi.org
> >>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> >
> >>> >
> >>> > _______________________________________________
> >>> > users mailing list
> >>> > us...@open-mpi.org
> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>>
> >>> --
> >>> Jeff Squyres
> >>> jsquy...@cisco.com
> >>> For corporate legal information go to:
> >>> http://www.cisco.com/web/about/doing_business/legal/cri/
> >>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Shared Memory Performance Problem.

Reply via email to