[Bug target/119900] [16 regression] imagick slowdown with -Ofast -march=native -fprofile-use since r16-39-gf6859fb621179e (interaction of rpad and late-combine)

hubicka at gcc dot gnu.org via Gcc-bugs Tue, 29 Apr 2025 06:33:21 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119900


Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsandifo at gcc dot gnu.org
             Status|UNCONFIRMED                 |ASSIGNED
            Summary|[16 regression] imagick     |[16 regression] imagick
                   |slowdown with -Ofast        |slowdown with -Ofast
                   |-march=native -fprofile-use |-march=native -fprofile-use
                   |since                       |since
                   |r16-39-gf6859fb621179e      |r16-39-gf6859fb621179e
                   |                            |(interaction of rpad and
                   |                            |late-combine)
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2025-04-29

--- Comment #4 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
The problem is interaction of the size_costs change. Either this patch
reverting this change:

diff --git a/gcc/config/i386/x86-tune-costs.h
b/gcc/config/i386/x86-tune-costs.h
index cddcf617304..a2512c7209a 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -121,17 +121,17 @@ struct processor_costs ix86_size_cost = {/* costs for
tuning for size */
   COSTS_N_BYTES (2),                   /* cost of FCHS instruction.  */
   COSTS_N_BYTES (2),                   /* cost of FSQRT instruction.  */

-  COSTS_N_BYTES (4),                   /* cost of cheap SSE instruction.  */
-  COSTS_N_BYTES (4),                   /* cost of ADDSS/SD SUBSS/SD insns.  */
-  COSTS_N_BYTES (4),                   /* cost of MULSS instruction.  */
-  COSTS_N_BYTES (4),                   /* cost of MULSD instruction.  */
-  COSTS_N_BYTES (4),                   /* cost of FMA SS instruction.  */
-  COSTS_N_BYTES (4),                   /* cost of FMA SD instruction.  */
-  COSTS_N_BYTES (4),                   /* cost of DIVSS instruction.  */
-  COSTS_N_BYTES (4),                   /* cost of DIVSD instruction.  */
-  COSTS_N_BYTES (4),                   /* cost of SQRTSS instruction.  */
-  COSTS_N_BYTES (4),                   /* cost of SQRTSD instruction.  */
-  COSTS_N_BYTES (4),                   /* cost of CVTSS2SD etc.  */
+  COSTS_N_BYTES (2),                   /* cost of cheap SSE instruction.  */
+  COSTS_N_BYTES (2),                   /* cost of ADDSS/SD SUBSS/SD insns.  */
+  COSTS_N_BYTES (2),                   /* cost of MULSS instruction.  */
+  COSTS_N_BYTES (2),                   /* cost of MULSD instruction.  */
+  COSTS_N_BYTES (2),                   /* cost of FMA SS instruction.  */
+  COSTS_N_BYTES (2),                   /* cost of FMA SD instruction.  */
+  COSTS_N_BYTES (2),                   /* cost of DIVSS instruction.  */
+  COSTS_N_BYTES (2),                   /* cost of DIVSD instruction.  */
+  COSTS_N_BYTES (2),                   /* cost of SQRTSS instruction.  */
+  COSTS_N_BYTES (2),                   /* cost of SQRTSD instruction.  */
+  COSTS_N_BYTES (2),                   /* cost of CVTSS2SD etc.  */
   COSTS_N_BYTES (4),                   /* cost of 256bit VCVTPS2PD etc.  */
   COSTS_N_BYTES (6),                   /* cost of 512bit VCVTPS2PD etc.  */
   1, 1, 1, 1,                          /* reassoc int, fp, vec_int, vec_fp. 
*/

or -fno-late-combine-instructions avoids the performance regression.

The internal loop of imagick is considered cold with FDO since train run does
not train it at all.  We end up changing:

@@ -1156,9 +1156,8 @@
        call    ParseGeometry
        vmovsd  16(%rsp), %xmm7
        vmovsd  .LC30(%rip), %xmm4
-       vxorps  %xmm3, %xmm3, %xmm3
-       testb   $4, %al
        vmovsd  %xmm7, 8(%rsp)
+       testb   $4, %al
        jne     .L210
 .L166: 
        vmovapd %xmm4, %xmm2

....

        jne     .L211
 .L167: 
        vmovsd  .LC31(%rip), %xmm5
-       vcvtusi2sdq     %r13, %xmm3, %xmm1
+       vcvtusi2sdq     %r13, %xmm1, %xmm1
        vmulsd  %xmm5, %xmm1, %xmm0

So previously we clared xmm3 and used it in integer->fp conversions while after
the patch we don't do this resulting in false dependency on xmm1.
The clear is introduced by the i386 specific rpad pass that is disabling itself
for functions optimized for size, but in this case the function contains hold
and cold region, so it inserts xors also into cold part of the program.

bool
ix86_rpad_gate ()
{   
  return (TARGET_AVX
          && TARGET_SSE_PARTIAL_REG_DEPENDENCY
          && TARGET_SSE_MATH
          && optimize
          && optimize_function_for_speed_p (cfun));
}                                      

I suppose it would make sense to disable RPAD for cold regions of the program
(which would make situation worse for imagick though).

This is now undone by late combine pass:

trying to combine definition of r24 in:
  388: xmm4:V2DF=vec_merge(vec_duplicate(uns_float(r14:DI)),xmm3:V2DF,0x1)
into:
  486: xmm4:DF=vec_select(xmm4:V2DF,parallel)
successfully matched this instruction to *floatunsdidf2_avx512:
(set (reg:DF 24 xmm4 [orig:168 _357 ] [168])
    (unsigned_float:DF (reg/v:DI 42 r14 [orig:151 former_height ] [151])))
original cost = 8 + 8, replacement cost = 16; keeping replacement

cost of 8 + 8 makes sense to me.  Simple SSE instruction are 4 bytes and
COST_N_BYTES(N) is N*2.
cost of floatunsdidf2_avx512 seems over-estimated since i386.c::ix86_rtx_costs
does not seem to consider it, but it should be 8 as well I guess.

before we did:

trying to combine definition of r24 in:
  388: xmm4:V2DF=vec_merge(vec_duplicate(uns_float(r14:DI)),xmm3:V2DF,0x1)
into:
  486: xmm4:DF=vec_select(xmm4:V2DF,parallel)
successfully matched this instruction to *floatunsdidf2_avx512:
(set (reg:DF 24 xmm4 [orig:168 _357 ] [168])
    (unsigned_float:DF (reg/v:DI 42 r14 [orig:151 former_height ] [151])))
original cost = 4 + 4, replacement cost = 16; rejecting replacement

So replacement was cancelled kind of by accident since floatunsdidf2_avx512 is
over-estimated while vec_merge+vec_select under-estimated.

I think the cost-model is kind of broken here, since extra dependency is not
modelled at all.

[Bug target/119900] [16 regression] imagick slowdown with -Ofast -march=native -fprofile-use since r16-39-gf6859fb621179e (interaction of rpad and late-combine)

Reply via email to