https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178

--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #9)
>     1703 :   401cb1: vmovq  %xmm1,%r9   (*)
>      834 :   401cb6: vmovq  %r8,%xmm1
>     1719 :   401cbb: vmovq  %r9,%xmm0   (*)
> 
> Look like %r9 is dead after the second (*), and it can be optimized to
> 
>     1703 :   401cb1: vmovq  %xmm1,%xmm0
>      834 :   401cb6: vmovq  %r8,%xmm1

Yep, we also have code like

-       movabsq $0x3ff03db8fde2ef4e, %r8
...
-       vmovq   %r8, %xmm11

or

        movq    .LC11(%rip), %rax
        vmovq   %rax, %xmm14

which is extremely odd to see ... (I didn't check how we arrive at that)

When I do

diff --git a/gcc/config/i386/x86-tune-costs.h
b/gcc/config/i386/x86-tune-costs.h
index 017ffa69958..4c51358d7b6 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -1585,7 +1585,7 @@ struct processor_costs znver2_cost = {
                                           in 32,64,128,256 and 512-bit.  */
   {8, 8, 8, 8, 16},                    /* cost of storing SSE registers
                                           in 32,64,128,256 and 512-bit.  */
-  6, 6,                                        /* SSE->integer and
integer->SSE
+  8, 8,                                        /* SSE->integer and
integer->SSE
                                           moves.  */
   8, 8,                                /* mask->integer and integer->mask
moves */
   {6, 6, 6},                           /* cost of loading mask register

performance improves from 128 seconds to 115 seconds.  The result is
a lot more stack spilling in the code but there are still cases like

        movq    .LC8(%rip), %rax
        vmovsd  .LC13(%rip), %xmm6
        vmovsd  .LC16(%rip), %xmm11
        vmovsd  .LC14(%rip), %xmm3
        vmovsd  .LC12(%rip), %xmm14
        vmovq   %rax, %xmm2
        vmovq   %rax, %xmm0
        movq    .LC9(%rip), %rax

see how we load .LC8 to %rax just to move it to xmm2 and xmm0 instead of
at least moving xmm2 to xmm0 (maybe that's now cprop_hardreg) or also
loading directly to xmm0.

In the end register pressure is the main issue but how we deal with it
is bad.  It's likely caused by a combination of PRE & hoisting & sinking
which together exploit

  if( ((*((unsigned int*) ((void*) (&((((srcGrid)[((FLAGS)+N_CELL_ENTRIES*((0)+
(0)*(1*(100))+(0)*(1*(100))*(1*(100))))+(i)]))))))) & (ACCEL))) {
   ux = 0.005;
   uy = 0.002;
   uz = 0.000;
  }

which makes the following computes partly compile-time resolvable.

I still think the above code generation issues need to be analyzed and
we should figure why we emit this weird code under register pressure.
I'll attach a testcase that has the function in question split out for
easier analysis.

Reply via email to