On 5 May 2011 16:08, Måns Rullgård <m...@mansr.com> wrote: > David Gilbert <david.gilb...@linaro.org> writes: >> Not quite: >> a) Neon memcpy/memset is worse on A9 than non-neon versions (better >> on A8 typically) > > That is not my experience at all. On the contrary, I've seen memcpy > throughput on A9 roughly double with use of NEON for large copies. > For small copies, plain ARM is might be faster since the overhead of > preparing for a properly aligned NEON loop is avoided. > > What do you base your claims on?
My tests here: https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy at the bottom of the page are sets of graphs for A9 (left) and A8 (right); on A9 the Neon memcpy's (red and green) top out much lower than their non-neon best equivalents (black and cyan). I've seen different results for very non-aligned copies, where the vld/vst on Neon work very well. Also, when I showed those numbers to the guys at ARM they all said it was a bad idea to use Neon on A9 for memory manipulation workloads. What code do you base your claims on :-) > I don't see the connection between Thumb2 and memcpy performance. > Thumb2 can do anything 32-bit ARM can. There are the purists who says write everything in Thumb2 now; however there is an interesting question of which is faster, and IMHO the ARM code is likely to be a bit faster in most cases. Dave _______________________________________________ linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev