Hi, this patch (partly) solves problem in PR119900 where changing ix86_size_cost of chap SSE instruction from 2 bytes to 4 bytes regresses imagemagick with PGO (119% on core and 54% on Zen)
There is an interesting chain of problems 1) the train run of the SPEC2017 imagick is wrong and it does not train the innermost loop of morphology apply used in ref run (other loop in the same function is trained instead) 2) rpad pass introduces XMM register with 0 which is used to break dependency chains in int->double conversion such as: vcvtusi2sdq %r14, %xmm4, %xmm1 xmm1 is the 0 register. The pass turns itself off if optimize_funcition_for_speed is false, so it only matches in the cold region because MorphologyApply contains other hot code. I think the pass should prevent introducing 0 registers for cold code, but that would make imagick situation only worse since it is mistrained. The code produced is tmp_reg:V2DF=vec_merge(vec_duplicate(uns_float(input_reg:DI)),input_reg2:V2DF,0x1) output_reg:DF=subreg (tmp_reg:V2DF, 0) 3) We have splitter translating subreg to vec_select introduced by Roger https://gcc.gnu.org/cgit/gcc/commit/?id=faa2202ee7fcf039b2016ce5766a2927526c5f78 So after register allocation we end up with: xmm4:DF=vec_select(xmm4:V2DF,parallel[0]) 4) late combined pass undoes the optimization: trying to combine definition of r24 in: 388: xmm4:V2DF=vec_merge(vec_duplicate(uns_float(r14:DI)),xmm3:V2DF,0x1) into: 486: xmm4:DF=vec_select(xmm4:V2DF,parallel) successfully matched this instruction to *floatunsdidf2_avx512: (set (reg:DF 24 xmm4 [orig:168 _357 ] [168]) (unsigned_float:DF (reg/v:DI 42 r14 [orig:151 former_height ] [151]))) original cost = 8 + 8, replacement cost = 16; keeping replacement Here the original cost is computed from cost->sse_op and at -Os it used to be 4+4 (since sse_op incorrecty cheaper). There are multiple problems in this chain of events, but I think first problem is costing xmm4:DF=vec_select(xmm4:V2DF,parallel[0]) as real SSE operation when it will often translate to nothing. Since VEC_SELECT is doing a job of a subreg, I think the rtx cost should be same as of the subreg which is 0. Also I wonder 1) why we don't also translate SUBREG of other types (SF, SI, DI etc.) of a vector register same was as we do DF 2) I think (define_insn "sse2_storelpd" [(set (match_operand:DF 0 "nonimmediate_operand" "=m,x,x,*f,r") (vec_select:DF (match_operand:V2DF 1 "nonimmediate_operand" " v,x,m,m,m") (parallel [(const_int 0)])))] "TARGET_SSE2 && !(MEM_P (operands[0]) && MEM_P (operands[1]))" "@ %vmovlpd\t{%1, %0|%0, %1} # # # #" [(set_attr "type" "ssemov,ssemov,ssemov,fmov,imov") (set (attr "prefix_data16") (if_then_else (eq_attr "alternative" "0") (const_string "1") (const_string "*"))) (set_attr "prefix" "maybe_vex") (set_attr "mode" "V1DF,DF,DF,DF,DF")]) (define_split [(set (match_operand:DF 0 "register_operand") (vec_select:DF (match_operand:V2DF 1 "nonimmediate_operand") (parallel [(const_int 0)])))] "TARGET_SSE2 && reload_completed" [(set (match_dup 0) (match_dup 1))] "operands[1] = gen_lowpart (DFmode, operands[1]);") effectively hides from the register-allocation the fact that if operand1 is same as operand0 the splitter will lead to ellimination of the instruction. Pehraps we should add extra alternative "=mx" "0" and increase cost of the others? 3) cost of (set (reg:DF 24 xmm4 [orig:168 _357 ] [168]) (unsigned_float:DF (reg/v:DI 42 r14 [orig:151 former_height ] [151]))) is computed as 16 (i.e. 8 bytes) which is not quite corect and I guess it is because does ix86_rtx_costs does not handle unsigned_float at all. We end up using generic rtx_cost factor = mode_size > UNITS_PER_WORD ? mode_size / UNITS_PER_WORD : 2; total = factor * COSTS_N_INSNS (1); This is lucky here since it prevents late combine from unding the rpad optimization. I wonder how to model it correctly though. Making RTX cost of (uns_float(r14:DI)) and RTX cost of vec_merge(vec_duplicate(uns_float(r14:DI)),xmm3:V2DF,0x1) cheap would prevent late combine from doing the replacement, but it would also confuse optimizations running before rapd, such as loop invariant motion as all conversions would look expensive... Bootstrapped/regtested x86_64-linux. Does it make sense? PR target/119900 * i386.cc (ix86_rtx_costs): Set cost of VEC_SELECT slecting first lane to 0. diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index 3171d6e0ad4..15220ca6647 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -22571,13 +22571,23 @@ ix86_rtx_costs (rtx x, machine_mode mode, int outer_code_i, int opno, } return false; - case VEC_SELECT: case VEC_CONCAT: /* ??? Assume all of these vector manipulation patterns are recognizable. In which case they all pretty much have the same cost. */ *total = cost->sse_op; return true; + case VEC_SELECT: + gcc_checking_assert (GET_CODE (XEXP (x, 1)) == PARALLEL); + /* Special case extracting first lane from the vector which often + translates to no code since most of SSE/AVX instructions have packed + and single forms. */ + if (XVECLEN (XEXP (x, 1), 0) == 1 + && XVECEXP (XEXP (x, 1), 0, 0) == const0_rtx) + *total = 0; + else + *total = cost->sse_op; + return true; case VEC_DUPLICATE: *total = rtx_cost (XEXP (x, 0), GET_MODE (XEXP (x, 0)),