11 Regression] AArch64 ld3 st4 less optimized

abhiraj.garakapati at gmail dot com via Gcc-bugs Mon, 30 Nov 2020 07:09:46 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89057


Abhiraj Garakapati <abhiraj.garakapati at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |abhiraj.garakapati at gmail 
dot co
                   |                            |m

--- Comment #7 from Abhiraj Garakapati <abhiraj.garakapati at gmail dot com> ---
This issue is observed during the RTL phase (test1.cpp.234r.expand i.e, during
Gimple to RTL conversion.) with -O1 flag enabled. (This issue is seen in -O1,
-O2, -O3 not in -O0.)

All these below 3 Gimple instructions are converted to 2 move instructions each
during Gimple to RTL conversion. This scenario is not seen in GCC-7.3.0 only
seen from GCC-8.1.0 due to the patch:
https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=a977dc0c5e069bf198f78ed4767deac369904301
  _68 = __builtin_aarch64_combinev8qi (_67, { 0, 0, 0, 0, 0, 0, 0, 0 });
  _69 = __builtin_aarch64_combinev8qi (_66, { 0, 0, 0, 0, 0, 0, 0, 0 });
  _70 = __builtin_aarch64_combinev8qi (_65, { 0, 0, 0, 0, 0, 0, 0, 0 });

This issue can be fixed by adding "-fno-move-loop-invariants" (as a
workaround).

This issue can be fixed on GCC-8.1.0 by reverting "aarch64-simd.md" file
changes in the patch:
https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=a977dc0c5e069bf198f78ed4767deac369904301

Also, cross-checked the newly built toolchain with reverting "aarch64-simd.md"
file changes with the above-mentioned test case and got the expected output
same as GCC-7.3.0.

With gcc 8.1 with reverting "aarch64-simd.md" file changes the inner loop is:
        .L5:
                ld3     {v4.8b-v6.8b}, [x1]
                add     x1, x1, #0x18
                mov     v0.8b, v6.8b
                mov     v1.8b, v5.8b
                mov     v2.8b, v4.8b
                mov     v3.16b, v7.16b
                st4     {v0.8b-v3.8b}, [x0]
                add     x0, x0, 32
                cmp     x3, x0
                bhi     .L5

Also, cross-checked it with the below test case (which is mentioned in patch:
https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=a977dc0c5e069bf198f78ed4767deac369904301
this patch improves code generation for literal vector construction by
expanding and exposing the pattern to RTL optimization earlier. The current
implementation delays splitting the pattern until after reload which results in
poor code generation for the following code)

Test case to show patch
improvement(https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=a977dc0c5e069bf198f78ed4767deac369904301
):

        #include "arm_neon.h"
        int16x8_t
        foo ()
        {
          return vcombine_s16 (vdup_n_s16 (0), vdup_n_s16 (8));
        }

GCC_8.1.0 -O1 with reverting "aarch64-simd.md" file changes:

        foo():
                adrp    x0, 0 <_Z3foov>
                ldr     q0, [x0]
                ret

So, reverting the "aarch64-simd.md" file changes does not result in poor code
generation.
Also, cross-checked it with the latest GCC version GCC-10.2.0.

[Bug target/89057] [8/9/10/11 Regression] AArch64 ld3 st4 less optimized

Reply via email to