> -----Original Message-----
> From: Jan Hubicka <hubi...@ucw.cz>
> Sent: Wednesday, April 30, 2025 4:11 AM
> To: gcc-patches@gcc.gnu.org; Liu, Hongtao <hongtao....@intel.com>;
> ro...@nextmovesoftware.com; ubiz...@gmail.com
> Subject: Make ix86 cost of VEC_SELECT equivalent to SUBREG same as of
> SUBREG
>
> Hi,
> this patch (partly) solves problem in PR119900 where changing ix86_size_cost
> of chap SSE instruction from 2 bytes to 4 bytes regresses imagemagick with
> PGO (119% on core and 54% on Zen)
>
> There is an interesting chain of problems
> 1) the train run of the SPEC2017 imagick is wrong and it does not train the
> innermost
> loop of morphology apply used in ref run (other loop in the same function
> is
> trained
> instead)
> 2) rpad pass introduces XMM register with 0 which is used to break
> dependency chains
> in int->double conversion such as:
>
> vcvtusi2sdq %r14, %xmm4, %xmm1
>
> xmm1 is the 0 register. The pass turns itself off if
> optimize_funcition_for_speed
> is false, so it only matches in the cold region because MorphologyApply
> contains
> other hot code. I think the pass should prevent introducing 0 registers
> for
> cold code, but that would make imagick situation only worse since it is
> mistrained.
>
> The code produced is
>
>
> tmp_reg:V2DF=vec_merge(vec_duplicate(uns_float(input_reg:DI)),input_reg2
> :V2DF,0x1)
> output_reg:DF=subreg (tmp_reg:V2DF, 0)
>
> 3) We have splitter translating subreg to vec_select introduced by Roger
>
> https://gcc.gnu.org/cgit/gcc/commit/?id=faa2202ee7fcf039b2016ce5766a2
> 927526c5f78
>
> So after register allocation we end up with:
> xmm4:DF=vec_select(xmm4:V2DF,parallel[0])
> 4) late combined pass undoes the optimization:
>
> trying to combine definition of r24 in:
> 388:
> xmm4:V2DF=vec_merge(vec_duplicate(uns_float(r14:DI)),xmm3:V2DF,0x1)
> into:
> 486: xmm4:DF=vec_select(xmm4:V2DF,parallel)
> successfully matched this instruction to *floatunsdidf2_avx512:
> (set (reg:DF 24 xmm4 [orig:168 _357 ] [168])
> (unsigned_float:DF (reg/v:DI 42 r14 [orig:151 former_height ] [151])))
> original cost = 8 + 8, replacement cost = 16; keeping replacement
>
> Here the original cost is computed from cost->sse_op and at -Os it used to be
> 4+4 (since sse_op incorrecty cheaper).
>
> There are multiple problems in this chain of events, but I think first
> problem is
> costing
>
> xmm4:DF=vec_select(xmm4:V2DF,parallel[0])
>
> as real SSE operation when it will often translate to nothing. Since
> VEC_SELECT
> is doing a job of a subreg, I think the rtx cost should be same as of the
> subreg
> which is 0.
>
target_insn_cost is used to prevent rpad optimization to be restored by
late_combine1, looks like it's not sufficient for size_cost.
21804static int
21805ix86_insn_cost (rtx_insn *insn, bool speed)
21806{
21807 int insn_cost = 0;
21808 /* Add extra cost to avoid post_reload late_combine revert
21809 the optimization did in pass_rpad. */
21810 if (reload_completed
21811 && ix86_rpad_gate ()
21812 && recog_memoized (insn) >= 0
21813 && get_attr_avx_partial_xmm_update (insn)
21814 == AVX_PARTIAL_XMM_UPDATE_TRUE)
21815 insn_cost += COSTS_N_INSNS (3);
21816
21817 return insn_cost + pattern_cost (PATTERN (insn), speed);
21818}
> Also I wonder
> 1) why we don't also translate SUBREG of other types (SF, SI, DI etc.) of a
> vector register
> same was as we do DF
> 2) I think
>
> (define_insn "sse2_storelpd"
> [(set (match_operand:DF 0 "nonimmediate_operand"
> "=m,x,x,*f,r")
> (vec_select:DF
> (match_operand:V2DF 1 "nonimmediate_operand" " v,x,m,m,m")
> (parallel [(const_int 0)])))]
> "TARGET_SSE2 && !(MEM_P (operands[0]) && MEM_P
> (operands[1]))"
> "@
> %vmovlpd\t{%1, %0|%0, %1}
> #
> #
> #
> #"
> [(set_attr "type" "ssemov,ssemov,ssemov,fmov,imov")
> (set (attr "prefix_data16")
> (if_then_else (eq_attr "alternative" "0")
> (const_string "1")
> (const_string "*")))
> (set_attr "prefix" "maybe_vex")
> (set_attr "mode" "V1DF,DF,DF,DF,DF")])
>
> (define_split
> [(set (match_operand:DF 0 "register_operand")
> (vec_select:DF
> (match_operand:V2DF 1 "nonimmediate_operand")
> (parallel [(const_int 0)])))]
> "TARGET_SSE2 && reload_completed"
> [(set (match_dup 0) (match_dup 1))]
> "operands[1] = gen_lowpart (DFmode, operands[1]);")
>
> effectively hides from the register-allocation the fact that if operand1
> is same
> as operand0
> the splitter will lead to ellimination of the instruction. Pehraps we
> should add
> extra
> alternative "=mx" "0" and increase cost of the others?
Sounds reasonable, add ? to other alternatives?
> 3) cost of
>
> (set (reg:DF 24 xmm4 [orig:168 _357 ] [168])
> (unsigned_float:DF (reg/v:DI 42 r14 [orig:151 former_height ] [151])))
>
> is computed as 16 (i.e. 8 bytes) which is not quite corect and I guess it
> is
> because
> does ix86_rtx_costs does not handle unsigned_float at all. We end up
> using
> generic rtx_cost
>
> factor = mode_size > UNITS_PER_WORD ? mode_size / UNITS_PER_WORD :
> 2;
> total = factor * COSTS_N_INSNS (1);
>
> This is lucky
> here since it prevents late combine from unding the rpad optimization. I
> wonder how to model it correctly though.
>
> Making RTX cost of
>
> (uns_float(r14:DI))
>
> and RTX cost of
>
> vec_merge(vec_duplicate(uns_float(r14:DI)),xmm3:V2DF,0x1)
>
> cheap would prevent late combine from doing the replacement, but it would
> also confuse
> optimizations running before rapd, such as loop invariant motion as all
> conversions
> would look expensive...
>
> Bootstrapped/regtested x86_64-linux. Does it make sense?
>
> PR target/119900
> * i386.cc (ix86_rtx_costs): Set cost of VEC_SELECT slecting first lane
> to
> 0.
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index
> 3171d6e0ad4..15220ca6647 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -22571,13 +22571,23 @@ ix86_rtx_costs (rtx x, machine_mode mode,
> int outer_code_i, int opno,
> }
> return false;
>
> - case VEC_SELECT:
> case VEC_CONCAT:
> /* ??? Assume all of these vector manipulation patterns are
> recognizable. In which case they all pretty much have the
> same cost. */
> *total = cost->sse_op;
> return true;
> + case VEC_SELECT:
> + gcc_checking_assert (GET_CODE (XEXP (x, 1)) == PARALLEL);
> + /* Special case extracting first lane from the vector which often
> + translates to no code since most of SSE/AVX instructions have packed
> + and single forms. */
> + if (XVECLEN (XEXP (x, 1), 0) == 1
> + && XVECEXP (XEXP (x, 1), 0, 0) == const0_rtx)
> + *total = 0;
For the cost of subreg, rtx_cost also checks modes_tieable_p.
4599 case SUBREG:
4600 total = 0;
4601 /* If we can't tie these modes, make this expensive. The larger
4602 the mode, the more expensive it is. */
4603 if (!targetm.modes_tieable_p (mode, GET_MODE (SUBREG_REG (x))))
4604 return COSTS_N_INSNS (2 + factor);
4605 break;
And we can use vec_series_lowpart_p instead of only check 1 element.
> + else
> + *total = cost->sse_op;
> + return true;
> case VEC_DUPLICATE:
> *total = rtx_cost (XEXP (x, 0),
> GET_MODE (XEXP (x, 0)),