Re: Optimized kernel memcpy/memset

David Gilbert Thu, 05 May 2011 08:47:03 -0700

On 5 May 2011 16:08, Måns Rullgård <m...@mansr.com> wrote:
> David Gilbert <david.gilb...@linaro.org> writes:
>> Not quite:
>>   a) Neon memcpy/memset is worse on A9 than non-neon versions (better
>> on A8 typically)
>
> That is not my experience at all.  On the contrary, I've seen memcpy
> throughput on A9 roughly double with use of NEON for large copies.
> For small copies, plain ARM is might be faster since the overhead of
> preparing for a properly aligned NEON loop is avoided.
>
> What do you base your claims on?


My tests here:
https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy

at the bottom of the page are sets of graphs for A9 (left) and A8 (right);
on A9 the Neon memcpy's (red and green) top out much lower than their non-neon
best equivalents (black and cyan).  I've seen different results for
very non-aligned
copies, where the vld/vst on Neon work very well.

Also, when I showed those numbers to the guys at ARM they all said it was
a bad idea to use Neon on A9 for memory manipulation workloads.

What code do you base your claims on :-)


> I don't see the connection between Thumb2 and memcpy performance.
> Thumb2 can do anything 32-bit ARM can.

There are the purists who says write everything in Thumb2 now; however
there is an
interesting question of which is faster, and IMHO the ARM code is
likely to be a bit
faster in most cases.

Dave

_______________________________________________
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev

Re: Optimized kernel memcpy/memset

Reply via email to