On Tue, Jul 6, 2021 at 9:01 PM Roger Sayle <ro...@nextmovesoftware.com> wrote:
>
>
> All of the optimizations/transformations mentioned in bugzilla for
> PR tree-optimization/40210 are already implemented in mainline GCC,
> with one exception.  In comment #5, there's a suggestion that
> (bswap64(x)>>56)&0xff can be implemented without the bswap as
> (unsigned char)x, or equivalently x&0xff.
>
> This patch implements the above optimization, and closely related
> variants.  For any single bit, (bswap(X)>>C1)&1 can be simplified
> to (X>>C2)&1, where bit position C2 is the appropriate permutation
> of C1.  Similarly, the bswap can eliminated if the desired set of
> bits all lie within the same byte, hence (bswap(x)>>8)&255 can
> always be optimized, as can (bswap(x)>>8)&123.
>
> Previously,
>
> int foo(long long x) {
>   return (__builtin_bswap64(x) >> 56) & 0xff;
> }
>
> compiled with -O2 to
> foo:    movq    %rdi, %rax
>         bswap   %rax
>         shrq    $56, %rax
>         ret
>
> with this patch, it now compiles to
> foo:    movzbl  %dil, %eax
>         ret
>
> This patch has been tested on x86_64-pc-linux-gnu with a "make
> bootstrap" and "make -k check" with no new failures.
>
> Ok for mainline?

I don't like get_builtin_precision too much, did you consider
simply using

+  (bit_and (convert1? (rshift@0 (convert2? (bswap@3 @1)) INTEGER_CST@2))

and TYPE_PRECISION (TREE_TYPE (@3))?  I think while we'll
see argument promotion and thus cannot use @1 to derive the type
the return value will be the original type.

Now, I see '8' being used which likely should be CHAR_TYPE_SIZE
since you also use char_type_node.

I wonder whether

+ /* (bswap(x) >> C1) & C2 can sometimes be simplified to (x >> C3) & C2.  */
+ (simplify
+  (bit_and (convert1? (rshift@0 (convert2? (bswap @1)) INTEGER_CST@2))
+          INTEGER_CST@3)

and

+ /* bswap(x) >> C1 can sometimes be simplified to (T)x >> C2.  */
+ (simplify
+  (rshift (convert? (bswap @0)) INTEGER_CST@1)

can build upon each other, for example by extending the latter
to handle more cases, transforming to ((T)x >> C2) & C3?
That might of course be only profitable when the bswap goes away.

Thanks,
Richard.

>
>
> 2021-07-06  Roger Sayle  <ro...@nextmovesoftware.com>
>
> gcc/ChangeLog
>         PR tree-optimization/40210
>         * builtins.c (get_builtin_precision): Helper function to determine
>         the precision in bits of a built-in function.
>         * builtins.h (get_builtin_precision): Prototype here.
>         * match.pd (bswap optimizations): Simplify (bswap(x)>>C1)&C2 as
>         (x>>C3)&C2 when possible.  Simplify bswap(x)>>C1 as ((T)x)>>C2
>         when possible.  Simplify bswap(x)&C1 as (x>>C2)&C1 when 0<=C1<=255.
>
> gcc/testsuite/ChangeLog
>         PR tree-optimization/40210
>         * gcc.dg/builtin-bswap-13.c: New test.
>         * gcc.dg/builtin-bswap-14.c: New test.
>
> Roger
> --
> Roger Sayle
> NextMove Software
> Cambridge, UK
>

Reply via email to