Re: Optimized kernel memcpy/memset

David Gilbert Thu, 05 May 2011 11:19:24 -0700

On 5 May 2011 18:59, Nicolas Pitre <nicolas.pi...@linaro.org> wrote:
> On Thu, 5 May 2011, David Gilbert wrote:
>
>> If people believe it's worth breaking the context-switching taboo and
>> putting a neon version into the kernel then yes I agree it's something
>> you'd want to do as a build and/or runtime selection - but that's
>> quite a big taboo to break.
>
> There is no taboo.  Only numbers.
>
> The cost of using Neon in the kernel is non negligible.  It is also
> hard to measure as it depends on the actual Neon usage simultaneously
> happening in user space or in other concurrent kernel contexts.  This is
> not something that a dedicated benchmark can evaluate.


Agreed, and that's also why it's partly a taboo; if it's only numbers but
numbers based on some set of benchmarks that no two people are going
to agree on it's very difficult.      It would have to show a good win
on something
complex and well agreed upon.

> There _are_ cases for Neon to be used in the kernel i.e. those where the
> initial cost is offset by the gain.  The first that comes to mind is
> crypto of course.  But there is also simple things like CRC32 which is
> used all over the place by BTRFS for example.  And that is the actual
> test case I think we should focus our efforts on, given that BTRFS is
> going to be the next major filesystem on Linux.  Last time I tried BTRFS
> on ARM, the CRC32 computation was dominating CPU usage big time. CRC32
> is easy to understand, easy to validate, and will provide the right
> reason for creating the needed infrastructure to manipulate the Neon
> context in kernel space.  Once that's in place we could move to other
> targets such as crypto which is already complex enough without having to
> bother with the Neon context handling.

Yes, while I've not actually looked at coding CRC32 or the crypto things
I agree that they feel like they have much more room for working with;
it's outside of the scope of what I was asked to look at however.

> The memcpy case is not interesting.  Not at all.  Most kernel memcpy
> calls are for small size copies.  The large copy instances are just bad
> and misdesigned in the first place if they rely on memcpy (maybe they
> should simply have a custom copy function, maybe implemented with Neon).

Even outside the kernel vast memcpy's are fairly rare as far as I can
tell - everyone
knows they're going to hurt so people try and avoid them;
the other thing is that people have been optimising ARM memcpy for decades
and it appears to me to be hitting cache/bus bandwidths somewhere (although
I don't have any figures for what those bandwidths are) - there may be
some scope
for optimising the smaller memcpy cases (e.g. taking advantage of things like
the newer cbz to cut a few instructions out) - from my graphs the
slope up to the
point at which the non-neon code plateaus is quite gradual, which suggests
it might be possible to optimise it a bit.
(Oddly the one case where my graph shows the neon winning is in small - ~32 byte
cases where it's almost certainly not worth the pain in the kernel of
protecting the
context switch).

> And I doubt the small memcpy's are going to gain anything from Neon.
> Even on X86 they don't do it, while they do have a CRC32 function using
> SSE2.  Maybe we could use Neon for copy_page() which is one of those
> custom bulk copy functions, but I've never seen memcpy() in kernel space
> show up on any profile.

Dave

_______________________________________________
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev

Re: Optimized kernel memcpy/memset

Reply via email to