On 5 May 2011 18:59, Nicolas Pitre <nicolas.pi...@linaro.org> wrote: > On Thu, 5 May 2011, David Gilbert wrote: > >> If people believe it's worth breaking the context-switching taboo and >> putting a neon version into the kernel then yes I agree it's something >> you'd want to do as a build and/or runtime selection - but that's >> quite a big taboo to break. > > There is no taboo. Only numbers. > > The cost of using Neon in the kernel is non negligible. It is also > hard to measure as it depends on the actual Neon usage simultaneously > happening in user space or in other concurrent kernel contexts. This is > not something that a dedicated benchmark can evaluate.
Agreed, and that's also why it's partly a taboo; if it's only numbers but numbers based on some set of benchmarks that no two people are going to agree on it's very difficult. It would have to show a good win on something complex and well agreed upon. > There _are_ cases for Neon to be used in the kernel i.e. those where the > initial cost is offset by the gain. The first that comes to mind is > crypto of course. But there is also simple things like CRC32 which is > used all over the place by BTRFS for example. And that is the actual > test case I think we should focus our efforts on, given that BTRFS is > going to be the next major filesystem on Linux. Last time I tried BTRFS > on ARM, the CRC32 computation was dominating CPU usage big time. CRC32 > is easy to understand, easy to validate, and will provide the right > reason for creating the needed infrastructure to manipulate the Neon > context in kernel space. Once that's in place we could move to other > targets such as crypto which is already complex enough without having to > bother with the Neon context handling. Yes, while I've not actually looked at coding CRC32 or the crypto things I agree that they feel like they have much more room for working with; it's outside of the scope of what I was asked to look at however. > The memcpy case is not interesting. Not at all. Most kernel memcpy > calls are for small size copies. The large copy instances are just bad > and misdesigned in the first place if they rely on memcpy (maybe they > should simply have a custom copy function, maybe implemented with Neon). Even outside the kernel vast memcpy's are fairly rare as far as I can tell - everyone knows they're going to hurt so people try and avoid them; the other thing is that people have been optimising ARM memcpy for decades and it appears to me to be hitting cache/bus bandwidths somewhere (although I don't have any figures for what those bandwidths are) - there may be some scope for optimising the smaller memcpy cases (e.g. taking advantage of things like the newer cbz to cut a few instructions out) - from my graphs the slope up to the point at which the non-neon code plateaus is quite gradual, which suggests it might be possible to optimise it a bit. (Oddly the one case where my graph shows the neon winning is in small - ~32 byte cases where it's almost certainly not worth the pain in the kernel of protecting the context switch). > And I doubt the small memcpy's are going to gain anything from Neon. > Even on X86 they don't do it, while they do have a CRC32 function using > SSE2. Maybe we could use Neon for copy_page() which is one of those > custom bulk copy functions, but I've never seen memcpy() in kernel space > show up on any profile. Dave _______________________________________________ linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev