Re: Optimized kernel memcpy/memset

Nicolas Pitre Thu, 05 May 2011 11:02:38 -0700

On Thu, 5 May 2011, Måns Rullgård wrote:

> David Gilbert <david.gilb...@linaro.org> writes:
> 
> > On 5 May 2011 17:45, Deepak Saxena <dsax...@plexity.net> wrote:
> >> On May 05 2011, at 16:46, David Gilbert was caught saying:
> >>> On 5 May 2011 16:08, Måns Rullgård <m...@mansr.com> wrote:
> >>> > David Gilbert <david.gilb...@linaro.org> writes:
> >>> >> Not quite:
> >>> >>   a) Neon memcpy/memset is worse on A9 than non-neon versions (better
> >>> >> on A8 typically)
> >>> >
> >>> > That is not my experience at all.  On the contrary, I've seen memcpy
> >>> > throughput on A9 roughly double with use of NEON for large copies.
> >>> > For small copies, plain ARM is might be faster since the overhead of
> >>> > preparing for a properly aligned NEON loop is avoided.
> >>> >
> >>> > What do you base your claims on?
> >>>
> >>> My tests here:
> >>> https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy
> >>>
> >>> at the bottom of the page are sets of graphs for A9 (left) and A8
> >>> (right); on A9 the Neon memcpy's (red and green) top out much lower
> >>> than their non-neon best equivalents (black and cyan).  I've seen
> >>> different results for very non-aligned copies, where the vld/vst on
> >>> Neon work very well.
> >>
> >> Looking at the top part of the page, it looks like when doing large size
> >> copies, NEON has an obvious advantage; however, I'm not sure how often
> >> we do copies of that magnitude in the kernel (I would hope rarely) but
> >> I don't know that we have numbers tracking average copy sizes for
> >> different workloads. I don't think going for a one-size-fits all
> >> approach is the ideal and instead we should provide both build
> >> and and runtime configurability (something similar to the RAID
> >> code's boot-up performance tests) to allow for selection of the
> >> appropriate memcpy implementation.
> >
> > The top part of the page is A8.  The graphs at the bottom page are
> > going upto 256k (log scale) so do have the large case and you can see
> > after the cliff where it drops off the cache the non-neon is still
> > winning for A9.
> 
> That is still well within the OMAP4 L2 cache (1MB) and the same size as
> the OMAP3 L2.  It would have been interesting to extend the graphs up to
> 8MB or so to ensure the caches become mostly irrelevant.


Please look at the subject line above again.

If you do perform 8MB memcpy calls in kernel space you have a 
bigger problem.


Nicolas

_______________________________________________
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev

Re: Optimized kernel memcpy/memset

Reply via email to