https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119900
Jan Hubicka <hubicka at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rsandifo at gcc dot gnu.org Status|UNCONFIRMED |ASSIGNED Summary|[16 regression] imagick |[16 regression] imagick |slowdown with -Ofast |slowdown with -Ofast |-march=native -fprofile-use |-march=native -fprofile-use |since |since |r16-39-gf6859fb621179e |r16-39-gf6859fb621179e | |(interaction of rpad and | |late-combine) Ever confirmed|0 |1 Last reconfirmed| |2025-04-29 --- Comment #4 from Jan Hubicka <hubicka at gcc dot gnu.org> --- The problem is interaction of the size_costs change. Either this patch reverting this change: diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h index cddcf617304..a2512c7209a 100644 --- a/gcc/config/i386/x86-tune-costs.h +++ b/gcc/config/i386/x86-tune-costs.h @@ -121,17 +121,17 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */ COSTS_N_BYTES (2), /* cost of FCHS instruction. */ COSTS_N_BYTES (2), /* cost of FSQRT instruction. */ - COSTS_N_BYTES (4), /* cost of cheap SSE instruction. */ - COSTS_N_BYTES (4), /* cost of ADDSS/SD SUBSS/SD insns. */ - COSTS_N_BYTES (4), /* cost of MULSS instruction. */ - COSTS_N_BYTES (4), /* cost of MULSD instruction. */ - COSTS_N_BYTES (4), /* cost of FMA SS instruction. */ - COSTS_N_BYTES (4), /* cost of FMA SD instruction. */ - COSTS_N_BYTES (4), /* cost of DIVSS instruction. */ - COSTS_N_BYTES (4), /* cost of DIVSD instruction. */ - COSTS_N_BYTES (4), /* cost of SQRTSS instruction. */ - COSTS_N_BYTES (4), /* cost of SQRTSD instruction. */ - COSTS_N_BYTES (4), /* cost of CVTSS2SD etc. */ + COSTS_N_BYTES (2), /* cost of cheap SSE instruction. */ + COSTS_N_BYTES (2), /* cost of ADDSS/SD SUBSS/SD insns. */ + COSTS_N_BYTES (2), /* cost of MULSS instruction. */ + COSTS_N_BYTES (2), /* cost of MULSD instruction. */ + COSTS_N_BYTES (2), /* cost of FMA SS instruction. */ + COSTS_N_BYTES (2), /* cost of FMA SD instruction. */ + COSTS_N_BYTES (2), /* cost of DIVSS instruction. */ + COSTS_N_BYTES (2), /* cost of DIVSD instruction. */ + COSTS_N_BYTES (2), /* cost of SQRTSS instruction. */ + COSTS_N_BYTES (2), /* cost of SQRTSD instruction. */ + COSTS_N_BYTES (2), /* cost of CVTSS2SD etc. */ COSTS_N_BYTES (4), /* cost of 256bit VCVTPS2PD etc. */ COSTS_N_BYTES (6), /* cost of 512bit VCVTPS2PD etc. */ 1, 1, 1, 1, /* reassoc int, fp, vec_int, vec_fp. */ or -fno-late-combine-instructions avoids the performance regression. The internal loop of imagick is considered cold with FDO since train run does not train it at all. We end up changing: @@ -1156,9 +1156,8 @@ call ParseGeometry vmovsd 16(%rsp), %xmm7 vmovsd .LC30(%rip), %xmm4 - vxorps %xmm3, %xmm3, %xmm3 - testb $4, %al vmovsd %xmm7, 8(%rsp) + testb $4, %al jne .L210 .L166: vmovapd %xmm4, %xmm2 .... jne .L211 .L167: vmovsd .LC31(%rip), %xmm5 - vcvtusi2sdq %r13, %xmm3, %xmm1 + vcvtusi2sdq %r13, %xmm1, %xmm1 vmulsd %xmm5, %xmm1, %xmm0 So previously we clared xmm3 and used it in integer->fp conversions while after the patch we don't do this resulting in false dependency on xmm1. The clear is introduced by the i386 specific rpad pass that is disabling itself for functions optimized for size, but in this case the function contains hold and cold region, so it inserts xors also into cold part of the program. bool ix86_rpad_gate () { return (TARGET_AVX && TARGET_SSE_PARTIAL_REG_DEPENDENCY && TARGET_SSE_MATH && optimize && optimize_function_for_speed_p (cfun)); } I suppose it would make sense to disable RPAD for cold regions of the program (which would make situation worse for imagick though). This is now undone by late combine pass: trying to combine definition of r24 in: 388: xmm4:V2DF=vec_merge(vec_duplicate(uns_float(r14:DI)),xmm3:V2DF,0x1) into: 486: xmm4:DF=vec_select(xmm4:V2DF,parallel) successfully matched this instruction to *floatunsdidf2_avx512: (set (reg:DF 24 xmm4 [orig:168 _357 ] [168]) (unsigned_float:DF (reg/v:DI 42 r14 [orig:151 former_height ] [151]))) original cost = 8 + 8, replacement cost = 16; keeping replacement cost of 8 + 8 makes sense to me. Simple SSE instruction are 4 bytes and COST_N_BYTES(N) is N*2. cost of floatunsdidf2_avx512 seems over-estimated since i386.c::ix86_rtx_costs does not seem to consider it, but it should be 8 as well I guess. before we did: trying to combine definition of r24 in: 388: xmm4:V2DF=vec_merge(vec_duplicate(uns_float(r14:DI)),xmm3:V2DF,0x1) into: 486: xmm4:DF=vec_select(xmm4:V2DF,parallel) successfully matched this instruction to *floatunsdidf2_avx512: (set (reg:DF 24 xmm4 [orig:168 _357 ] [168]) (unsigned_float:DF (reg/v:DI 42 r14 [orig:151 former_height ] [151]))) original cost = 4 + 4, replacement cost = 16; rejecting replacement So replacement was cancelled kind of by accident since floatunsdidf2_avx512 is over-estimated while vec_merge+vec_select under-estimated. I think the cost-model is kind of broken here, since extra dependency is not modelled at all.