On 11.08.21 16:28, Stefan Roese wrote:
On 11.08.21 16:25, Tom Rini wrote:
On Wed, Aug 11, 2021 at 04:02:39PM +0200, Stefan Roese wrote:
On an NXP LX2160 based platform it has been noticed, that the currently
implemented memset/memcpy functions for aarch64 are suboptimal.
Especially the memset() for clearing the NXP MC firmware memory is very
expensive (time-wise).
By using optimized functions, a speedup of ~ factor 6 has been measured.
To be clear, you re-measured with the cache check code added, and this
is the speed up?
I forgot doing this. BTW: I was wrong with factor ~6. From my notices,
it is ~ factor 4 using the optimized memset() version.
I'll follow-up on this mail with some measurements for all affected
functions, using small and large sizes. Hopefully tomorrow.
Here the numbers:
Current original version:
-------------------------
memset() 32 Bytes, 16M times:
time: 0.446 seconds
memset() 16MiB, 256 times:
time: 1.076 seconds
memcpy() 512MiB:
time: 0.224 seconds
New optimized version:
----------------------
memset() 32 Bytes, 16M times:
time: 0.287 seconds
memset() 16MiB, 256 times:
time: 0.292 seconds
memcpy() 512MiB:
time: 0.222 seconds
Summary:
The optimized memcpy is nearly identical to the original one. But the
optimized memset is much faster, for small and big sizes. Small sizes
factor ~1.6 and big sizes factor ~3.7.
Note: These measurements were done on the NXP LX2160ARDB board.
Thanks,
Stefan