Hi,
this patch (partly) solves problem in PR119900 where changing ix86_size_cost
of chap SSE instruction from 2 bytes to 4 bytes regresses imagemagick with PGO
(119% on core and 54% on Zen)

There is an interesting chain of problems
 1) the train run of the SPEC2017 imagick is wrong and it does not train the 
innermost
    loop of morphology apply used in ref run (other loop in the same function 
is trained
    instead)
 2) rpad pass introduces XMM register with 0 which is used to break dependency 
chains
    in int->double conversion such as:

    vcvtusi2sdq     %r14, %xmm4, %xmm1

    xmm1 is the 0 register.  The pass turns itself off if 
optimize_funcition_for_speed
    is false, so it only matches in the cold region because MorphologyApply 
contains
    other hot code.  I think the pass should prevent introducing 0 registers for
    cold code, but that would make imagick situation only worse since it is 
mistrained.

    The code produced is

    
tmp_reg:V2DF=vec_merge(vec_duplicate(uns_float(input_reg:DI)),input_reg2:V2DF,0x1)
    output_reg:DF=subreg (tmp_reg:V2DF, 0)

 3) We have splitter translating subreg to vec_select introduced by Roger
    
https://gcc.gnu.org/cgit/gcc/commit/?id=faa2202ee7fcf039b2016ce5766a2927526c5f78

    So after register allocation we end up with:
    xmm4:DF=vec_select(xmm4:V2DF,parallel[0])
 4) late combined pass undoes the optimization:

    trying to combine definition of r24 in:
      388: xmm4:V2DF=vec_merge(vec_duplicate(uns_float(r14:DI)),xmm3:V2DF,0x1)
    into:
      486: xmm4:DF=vec_select(xmm4:V2DF,parallel)
    successfully matched this instruction to *floatunsdidf2_avx512:
    (set (reg:DF 24 xmm4 [orig:168 _357 ] [168])
        (unsigned_float:DF (reg/v:DI 42 r14 [orig:151 former_height ] [151])))
    original cost = 8 + 8, replacement cost = 16; keeping replacement

Here the original cost is computed from cost->sse_op and at -Os it used to be
4+4 (since sse_op incorrecty cheaper).

There are multiple problems in this chain of events, but I think first problem 
is
costing

      xmm4:DF=vec_select(xmm4:V2DF,parallel[0])

as real SSE operation when it will often translate to nothing.  Since
VEC_SELECT is doing a job of a subreg, I think the rtx cost should be same as of
the subreg which is 0.

Also I wonder
 1) why we don't also translate SUBREG of other types (SF, SI, DI etc.) of a 
vector register
    same was as we do DF
 2) I think

        (define_insn "sse2_storelpd"
          [(set (match_operand:DF 0 "nonimmediate_operand"     "=m,x,x,*f,r")
            (vec_select:DF
              (match_operand:V2DF 1 "nonimmediate_operand" " v,x,m,m,m")
              (parallel [(const_int 0)])))]
          "TARGET_SSE2 && !(MEM_P (operands[0]) && MEM_P (operands[1]))"
          "@
           %vmovlpd\t{%1, %0|%0, %1}
           #
           #
           #
           #"
          [(set_attr "type" "ssemov,ssemov,ssemov,fmov,imov")
           (set (attr "prefix_data16")
             (if_then_else (eq_attr "alternative" "0")
                   (const_string "1")
                   (const_string "*")))
           (set_attr "prefix" "maybe_vex")
           (set_attr "mode" "V1DF,DF,DF,DF,DF")])

        (define_split
          [(set (match_operand:DF 0 "register_operand")
            (vec_select:DF
              (match_operand:V2DF 1 "nonimmediate_operand")
              (parallel [(const_int 0)])))]
          "TARGET_SSE2 && reload_completed"
          [(set (match_dup 0) (match_dup 1))]
          "operands[1] = gen_lowpart (DFmode, operands[1]);")

   effectively hides from the register-allocation the fact that if operand1 is 
same as operand0
   the splitter will lead to ellimination of the instruction.  Pehraps we 
should add extra
   alternative "=mx" "0" and increase cost of the others?
 3) cost of

    (set (reg:DF 24 xmm4 [orig:168 _357 ] [168])
        (unsigned_float:DF (reg/v:DI 42 r14 [orig:151 former_height ] [151])))

    is computed as 16 (i.e. 8 bytes) which is not quite corect and I guess it 
is because
    does ix86_rtx_costs does not handle unsigned_float at all.  We end up using 
generic rtx_cost

    factor = mode_size > UNITS_PER_WORD ? mode_size / UNITS_PER_WORD : 2;
    total = factor * COSTS_N_INSNS (1);

    This is lucky
    here since it prevents late combine from unding the rpad optimization.  I
    wonder how to model it correctly though.

    Making RTX cost of

      (uns_float(r14:DI))

    and RTX cost of

      vec_merge(vec_duplicate(uns_float(r14:DI)),xmm3:V2DF,0x1)

    cheap would prevent late combine from doing the replacement, but it would 
also confuse
    optimizations running before rapd, such as loop invariant motion as all 
conversions
    would look expensive...

Bootstrapped/regtested x86_64-linux. Does it make sense?

        PR target/119900
        * i386.cc (ix86_rtx_costs): Set cost of VEC_SELECT slecting first lane 
to 0.
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 3171d6e0ad4..15220ca6647 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -22571,13 +22571,23 @@ ix86_rtx_costs (rtx x, machine_mode mode, int 
outer_code_i, int opno,
        }
       return false;
 
-    case VEC_SELECT:
     case VEC_CONCAT:
       /* ??? Assume all of these vector manipulation patterns are
         recognizable.  In which case they all pretty much have the
         same cost.  */
      *total = cost->sse_op;
      return true;
+    case VEC_SELECT:
+     gcc_checking_assert (GET_CODE (XEXP (x, 1)) == PARALLEL);
+     /* Special case extracting first lane from the vector which often
+       translates to no code since most of SSE/AVX instructions have packed
+       and single forms.  */
+     if (XVECLEN (XEXP (x, 1), 0) == 1
+        && XVECEXP (XEXP (x, 1), 0, 0) == const0_rtx)
+       *total = 0;
+     else
+       *total = cost->sse_op;
+     return true;
     case VEC_DUPLICATE:
       *total = rtx_cost (XEXP (x, 0),
                         GET_MODE (XEXP (x, 0)),

Reply via email to