> -----Original Message-----
> From: Jan Hubicka <hubi...@ucw.cz>
> Sent: Wednesday, April 30, 2025 4:11 AM
> To: gcc-patches@gcc.gnu.org; Liu, Hongtao <hongtao....@intel.com>;
> ro...@nextmovesoftware.com; ubiz...@gmail.com
> Subject: Make ix86 cost of VEC_SELECT equivalent to SUBREG same as of
> SUBREG
> 
> Hi,
> this patch (partly) solves problem in PR119900 where changing ix86_size_cost
> of chap SSE instruction from 2 bytes to 4 bytes regresses imagemagick with
> PGO (119% on core and 54% on Zen)
> 
> There is an interesting chain of problems
>  1) the train run of the SPEC2017 imagick is wrong and it does not train the
> innermost
>     loop of morphology apply used in ref run (other loop in the same function 
> is
> trained
>     instead)
>  2) rpad pass introduces XMM register with 0 which is used to break
> dependency chains
>     in int->double conversion such as:
> 
>     vcvtusi2sdq     %r14, %xmm4, %xmm1
> 
>     xmm1 is the 0 register.  The pass turns itself off if
> optimize_funcition_for_speed
>     is false, so it only matches in the cold region because MorphologyApply
> contains
>     other hot code.  I think the pass should prevent introducing 0 registers 
> for
>     cold code, but that would make imagick situation only worse since it is
> mistrained.
> 
>     The code produced is
> 
> 
> tmp_reg:V2DF=vec_merge(vec_duplicate(uns_float(input_reg:DI)),input_reg2
> :V2DF,0x1)
>     output_reg:DF=subreg (tmp_reg:V2DF, 0)
> 
>  3) We have splitter translating subreg to vec_select introduced by Roger
> 
> https://gcc.gnu.org/cgit/gcc/commit/?id=faa2202ee7fcf039b2016ce5766a2
> 927526c5f78
> 
>     So after register allocation we end up with:
>     xmm4:DF=vec_select(xmm4:V2DF,parallel[0])
>  4) late combined pass undoes the optimization:
> 
>     trying to combine definition of r24 in:
>       388:
> xmm4:V2DF=vec_merge(vec_duplicate(uns_float(r14:DI)),xmm3:V2DF,0x1)
>     into:
>       486: xmm4:DF=vec_select(xmm4:V2DF,parallel)
>     successfully matched this instruction to *floatunsdidf2_avx512:
>     (set (reg:DF 24 xmm4 [orig:168 _357 ] [168])
>         (unsigned_float:DF (reg/v:DI 42 r14 [orig:151 former_height ] [151])))
>     original cost = 8 + 8, replacement cost = 16; keeping replacement
> 
> Here the original cost is computed from cost->sse_op and at -Os it used to be
> 4+4 (since sse_op incorrecty cheaper).
> 
> There are multiple problems in this chain of events, but I think first 
> problem is
> costing
> 
>       xmm4:DF=vec_select(xmm4:V2DF,parallel[0])
> 
> as real SSE operation when it will often translate to nothing.  Since 
> VEC_SELECT
> is doing a job of a subreg, I think the rtx cost should be same as of the 
> subreg
> which is 0.
> 
target_insn_cost is used to prevent rpad optimization to be restored by 
late_combine1, looks like it's not sufficient for size_cost.

21804static int
21805ix86_insn_cost (rtx_insn *insn, bool speed)
21806{
21807  int insn_cost = 0;
21808  /* Add extra cost to avoid post_reload late_combine revert
21809     the optimization did in pass_rpad.  */
21810  if (reload_completed
21811      && ix86_rpad_gate ()
21812      && recog_memoized (insn) >= 0
21813      && get_attr_avx_partial_xmm_update (insn)
21814      == AVX_PARTIAL_XMM_UPDATE_TRUE)
21815    insn_cost += COSTS_N_INSNS (3);
21816
21817  return insn_cost + pattern_cost (PATTERN (insn), speed);
21818}

> Also I wonder
>  1) why we don't also translate SUBREG of other types (SF, SI, DI etc.) of a
> vector register
>     same was as we do DF
>  2) I think
> 
>       (define_insn "sse2_storelpd"
>         [(set (match_operand:DF 0 "nonimmediate_operand"
> "=m,x,x,*f,r")
>           (vec_select:DF
>             (match_operand:V2DF 1 "nonimmediate_operand" " v,x,m,m,m")
>             (parallel [(const_int 0)])))]
>         "TARGET_SSE2 && !(MEM_P (operands[0]) && MEM_P
> (operands[1]))"
>         "@
>          %vmovlpd\t{%1, %0|%0, %1}
>          #
>          #
>          #
>          #"
>         [(set_attr "type" "ssemov,ssemov,ssemov,fmov,imov")
>          (set (attr "prefix_data16")
>            (if_then_else (eq_attr "alternative" "0")
>                  (const_string "1")
>                  (const_string "*")))
>          (set_attr "prefix" "maybe_vex")
>          (set_attr "mode" "V1DF,DF,DF,DF,DF")])
> 
>       (define_split
>         [(set (match_operand:DF 0 "register_operand")
>           (vec_select:DF
>             (match_operand:V2DF 1 "nonimmediate_operand")
>             (parallel [(const_int 0)])))]
>         "TARGET_SSE2 && reload_completed"
>         [(set (match_dup 0) (match_dup 1))]
>         "operands[1] = gen_lowpart (DFmode, operands[1]);")
> 
>    effectively hides from the register-allocation the fact that if operand1 
> is same
> as operand0
>    the splitter will lead to ellimination of the instruction.  Pehraps we 
> should add
> extra
>    alternative "=mx" "0" and increase cost of the others?
Sounds reasonable, add ? to other alternatives?
>  3) cost of
> 
>     (set (reg:DF 24 xmm4 [orig:168 _357 ] [168])
>         (unsigned_float:DF (reg/v:DI 42 r14 [orig:151 former_height ] [151])))
> 
>     is computed as 16 (i.e. 8 bytes) which is not quite corect and I guess it 
> is
> because
>     does ix86_rtx_costs does not handle unsigned_float at all.  We end up 
> using
> generic rtx_cost
> 
>     factor = mode_size > UNITS_PER_WORD ? mode_size / UNITS_PER_WORD :
> 2;
>     total = factor * COSTS_N_INSNS (1);
> 
>     This is lucky
>     here since it prevents late combine from unding the rpad optimization.  I
>     wonder how to model it correctly though.
> 
>     Making RTX cost of
> 
>       (uns_float(r14:DI))
> 
>     and RTX cost of
> 
>       vec_merge(vec_duplicate(uns_float(r14:DI)),xmm3:V2DF,0x1)
> 
>     cheap would prevent late combine from doing the replacement, but it would
> also confuse
>     optimizations running before rapd, such as loop invariant motion as all
> conversions
>     would look expensive...
> 
> Bootstrapped/regtested x86_64-linux. Does it make sense?
> 
>       PR target/119900
>       * i386.cc (ix86_rtx_costs): Set cost of VEC_SELECT slecting first lane 
> to
> 0.
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index
> 3171d6e0ad4..15220ca6647 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -22571,13 +22571,23 @@ ix86_rtx_costs (rtx x, machine_mode mode,
> int outer_code_i, int opno,
>       }
>        return false;
> 
> -    case VEC_SELECT:
>      case VEC_CONCAT:
>        /* ??? Assume all of these vector manipulation patterns are
>        recognizable.  In which case they all pretty much have the
>        same cost.  */
>       *total = cost->sse_op;
>       return true;
> +    case VEC_SELECT:
> +     gcc_checking_assert (GET_CODE (XEXP (x, 1)) == PARALLEL);
> +     /* Special case extracting first lane from the vector which often
> +     translates to no code since most of SSE/AVX instructions have packed
> +     and single forms.  */
> +     if (XVECLEN (XEXP (x, 1), 0) == 1
> +      && XVECEXP (XEXP (x, 1), 0, 0) == const0_rtx)
> +     *total = 0;
For the cost of subreg, rtx_cost also checks modes_tieable_p.

4599    case SUBREG:
4600      total = 0;
4601      /* If we can't tie these modes, make this expensive.  The larger
4602         the mode, the more expensive it is.  */
4603      if (!targetm.modes_tieable_p (mode, GET_MODE (SUBREG_REG (x))))
4604        return COSTS_N_INSNS (2 + factor);
4605      break;

And we can use vec_series_lowpart_p instead of only check 1 element.

> +     else
> +       *total = cost->sse_op;
> +     return true;
>      case VEC_DUPLICATE:
>        *total = rtx_cost (XEXP (x, 0),
>                        GET_MODE (XEXP (x, 0)),

Reply via email to