On Fri, 06 Nov 2020, Nick Desaulniers <ndesaulni...@google.com>
wrote:
On Fri, Nov 6, 2020 at 3:50 AM Adrian Ratiu
<adrian.ra...@collabora.com> wrote:
Hi Nathan,
On Fri, 06 Nov 2020, Nathan Chancellor
<natechancel...@gmail.com> wrote:
> + Ard, who wrote this code.
>
> On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote:
>> Due to a Clang bug [1] neon autoloop vectorization does not
>> happen or happens badly with no gains and considering
>> previous GCC experiences which generated unoptimized code
>> which was worse than the default asm implementation, it is
>> safer to default clang builds to the known good generic
>> implementation. The kernel currently supports a minimum
>> Clang version of v10.0.1, see commit 1f7a44f63e6c
>> ("compiler-clang: add build check for clang 10.0.1"). When
>> the bug gets eventually fixed, this commit could be reverted
>> or, if the minimum clang version bump takes a long time, a
>> warning could be added for users to upgrade their compilers
>> like was done for GCC. [1]
>> https://bugs.llvm.org/show_bug.cgi?id=40976 Signed-off-by:
>> Adrian Ratiu <adrian.ra...@collabora.com>
>
> Thank you for the patch! We are also tracking this here:
>
> https://github.com/ClangBuiltLinux/linux/issues/496
>
> It was on my TODO to revist getting the warning eliminated,
> which likely would have involved a patch like this as well.
>
> I am curious if it is worth revisting or dusting off Arnd's
> patch in the LLVM bug tracker first. I have not tried it
> personally. If that is not a worthwhile option, I am fine
> with this for now. It would be nice to try and get a fix
> pinned down on the LLVM side at some point but alas, finite
> amount of resources and people :(
I tested Arnd's kernel patch from the LLVM bugtracker [1], but
with the Clang v10.0.1 I still get warnings like the following
even though the __restrict workaround seems to affect the
generated instructions:
./include/asm-generic/xor.h:15:2: remark: the cost-model
indicates that interleaving is not beneficial
[-Rpass-missed=loop-vectorize]
./include/asm-generic/xor.h:11:1: remark: List vectorization
was possible but not beneficial with cost 0 >= 0
[-Rpass-missed=slp-vectorizer] xor_8regs_2(unsigned long bytes,
unsigned long *__restrict p1, unsigned long *__restrict p2)
If it's just a matter of overruling the cost model #pragma clang
loop vectorize(enable)
will do the trick.
Indeed, ``` diff --git a/include/asm-generic/xor.h
b/include/asm-generic/xor.h index b62a2a56a4d4..8796955498b7
100644 --- a/include/asm-generic/xor.h +++
b/include/asm-generic/xor.h @@ -12,6 +12,7 @@
xor_8regs_2(unsigned long bytes, unsigned long *p1, unsigned
long *p2)
{
long lines = bytes / (sizeof (long)) / 8;
+#pragma clang loop vectorize(enable)
do {
p1[0] ^= p2[0]; p1[1] ^= p2[1];
@@ -32,6 +33,7 @@ xor_8regs_3(unsigned long bytes, unsigned long
*p1, unsigned long *p2,
{
long lines = bytes / (sizeof (long)) / 8;
+#pragma clang loop vectorize(enable)
do {
p1[0] ^= p2[0] ^ p3[0]; p1[1] ^= p2[1] ^ p3[1];
@@ -53,6 +55,7 @@ xor_8regs_4(unsigned long bytes, unsigned long
*p1, unsigned long *p2,
{
long lines = bytes / (sizeof (long)) / 8;
+#pragma clang loop vectorize(enable)
do {
p1[0] ^= p2[0] ^ p3[0] ^ p4[0]; p1[1] ^= p2[1] ^
p3[1] ^ p4[1];
@@ -75,6 +78,7 @@ xor_8regs_5(unsigned long bytes, unsigned long
*p1, unsigned long *p2,
{
long lines = bytes / (sizeof (long)) / 8;
+#pragma clang loop vectorize(enable)
do {
p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1] ^=
p2[1] ^ p3[1] ^ p4[1] ^ p5[1];
``` seems to generate the vectorized code.
Why don't we find a way to make those pragma's more toolchain
portable, rather than open coding them like I have above rather
than this series?
Hi Nick,
Thank you very much for the suggestion.
I agree. If a toolchain portable way can be found to realiably
trigger the optimization, I will gladly replace this patch. :)
Will work on it starting Monday then report back my findings or,
if I can get it to work in a satisfying manner, send a v2 series
directly.
The first patch is still needed because it's more of a general
cleanup as Nathan correctly observed.
Regards,
Adrian
--
Thanks,
~Nick Desaulniers