Re: [RFC] middle-end: Extend CSE to understand vector extracts.

Richard Biener Mon, 04 Jan 2021 05:33:31 -0800

On Mon, 4 Jan 2021, Tamar Christina wrote:

> Hi All,
> 
> I am trying to get CSE to re-use constants already inside a vector rather than
> re-materializing the constant again.
> 
> Basically consider the following case:
> 
> #include <stdint.h>
> #include <arm_neon.h>
> 
> uint64_t
> test (uint64_t a, uint64x2_t b, uint64x2_t* rt)
> {
>   uint64_t arr[2] = { 0x0942430810234076UL, 0x0942430810234076UL};
>   uint64_t res = a | arr[0];
>   uint64x2_t val = vld1q_u64 (arr);
>   *rt = vaddq_u64 (val, b);
>   return res;
> }
> 
> The actual behavior is inconsequential however notice that the same constants
> are used in the vector (arr and later val) and in the calculation of res.
> 
> The code we generate for this however is quite sub-optimal:
> 
> test:
>         adrp    x2, .LC0
>         sub     sp, sp, #16
>         ldr     q1, [x2, #:lo12:.LC0]
>         mov     x2, 16502
>         movk    x2, 0x1023, lsl 16
>         movk    x2, 0x4308, lsl 32
>         add     v1.2d, v1.2d, v0.2d
>         movk    x2, 0x942, lsl 48
>         orr     x0, x0, x2
>         str     q1, [x1]
>         add     sp, sp, 16
>         ret
> .LC0:
>         .xword  667169396713799798
>         .xword  667169396713799798
> 
> Essentially we materialize the same constant twice.  The reason for this is
> because the front-end lowers the constant extracted from arr[0] quite early 
> on.
> If you look into the result of fre you'll find
> 
>   <bb 2> :
>   arr[0] = 667169396713799798;
>   arr[1] = 667169396713799798;
>   res_7 = a_6(D) | 667169396713799798;
>   _16 = __builtin_aarch64_ld1v2di (&arr);
>   _17 = VIEW_CONVERT_EXPR<uint64x2_t>(_16);
>   _11 = b_10(D) + _17;
>   *rt_12(D) = _11;
>   arr ={v} {CLOBBER};
>   return res_7;
> 
> Which makes sense for further optimization.  However come expand time if the
> constant isn't representable in the target arch it will be assigned to a
> register again.
> 
> (insn 8 5 9 2 (set (reg:V2DI 99)
>         (const_vector:V2DI [
>                 (const_int 667169396713799798 [0x942430810234076]) repeated x2
>             ])) "cse.c":7:12 -1
>      (nil))
> ...
> (insn 14 13 15 2 (set (reg:DI 103)
>         (const_int 667169396713799798 [0x942430810234076])) "cse.c":8:12 -1
>      (nil))
> (insn 15 14 16 2 (set (reg:DI 102 [ res ])
>         (ior:DI (reg/v:DI 96 [ a ])
>             (reg:DI 103))) "cse.c":8:12 -1
>      (nil))
> 
> And since it's out of the immediate range of the scalar instruction used
> combine won't be able to do anything here.
> 
> This will then trigger the re-materialization of the constant twice.
> 
> So I figured the best place to handle this is in CSE since in some uArch it's
> far cheaper to extract a constant from a vector than to materialize it.
> 
> Particularly doing it pre-RA has the benefit of allowing RA to decide whether 
> it
> needs to move the constant between register files or not as some uArch can
> perform scalar operation both on the SIMD and GENREG side.
> 
> The issue is I don't know that much about CSE.  I have been reading through 
> the
> source and think I have a basic understanding of how it works but this email 
> is
> to see if I'm on the right track or not (to something that is acceptable
> upstream).
> 
> My current patch for CSE is:
> 
> diff --git a/gcc/cse.c b/gcc/cse.c
> index 36bcfc354d8..3cee53bed85 100644
> --- a/gcc/cse.c
> +++ b/gcc/cse.c
> @@ -43,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "rtl-iter.h"
>  #include "regs.h"
>  #include "function-abi.h"
> +#include "expr.h"
> 
>  /* The basic idea of common subexpression elimination is to go
>     through the code, keeping a record of expressions that would
> @@ -4306,6 +4307,20 @@ find_sets_in_insn (rtx_insn *insn, struct set **psets)
>          someplace else, so it isn't worth cse'ing.  */
>        else if (GET_CODE (SET_SRC (x)) == CALL)
>         ;
> +      else if (GET_CODE (SET_SRC (x)) == CONST_VECTOR)
> +       {
> +         /* First register the vector itself.  */
> +         sets[n_sets++].rtl = x;
> +         rtx src = SET_SRC (x);
> +         machine_mode elem_mode = GET_MODE_INNER (GET_MODE (src));
> +          /* Go over the constants of the CONST_VECTOR in forward order, to
> +            put them in the same order in the SETS array.  */
> +         for (unsigned i = 0; i < const_vector_encoded_nelts (src) ; i++)
> +           {
> +             rtx y = gen_rtx_SUBREG (elem_mode, SET_DEST (x), i);
> +             sets[n_sets++].rtl = PATTERN (gen_move_insn (y, 
> CONST_VECTOR_ELT (src, i)));
> +           }
> +       }
>        else
>         sets[n_sets++].rtl = x;
>      }
> @@ -4545,7 +4560,14 @@ cse_insn (rtx_insn *insn)
>    struct set *sets = (struct set *) 0;
> 
>    if (GET_CODE (x) == SET)
> -    sets = XALLOCA (struct set);
> +    {
> +      /* For CONST_VECTOR we wants to be able to CSE the vector itself along 
> with
> +        elements inside the vector if the target says it's cheap.  */
> +      if (GET_CODE (SET_SRC (x)) == CONST_VECTOR)
> +       sets = XALLOCAVEC (struct set, const_vector_encoded_nelts (SET_SRC 
> (x)) + 1);
> +      else
> +       sets = XALLOCA (struct set);
> +    }
>    else if (GET_CODE (x) == PARALLEL)
>      sets = XALLOCAVEC (struct set, XVECLEN (x, 0));
> 
> --
> 
> This extends the sets that CSE uses to perform CSE to not only contain the
> CONST_VECTOR but also the individual elements of the vector.
> 
> For each element I generate new RTL which models them as a constant being set
> into a subreg of the original vector at the index of the element in the 
> vector.
> 
> This so that the SRC is the constant we want to CSE and DEST contains the
> SUBREG to extract from the vector.
> 
> It works as expected, the testcase above generates:
> 
> test:
>         adrp    x2, .LC0
>         sub     sp, sp, #16
>         ldr     q1, [x2, #:lo12:.LC0]
>         add     v0.2d, v1.2d, v0.2d
>         fmov    x2, d1
>         str     q0, [x1]
>         orr     x0, x0, x2
>         add     sp, sp, 16
>         ret
> .LC0:
>         .xword  667169396713799798
>         .xword  667169396713799798
> 
> The problem is that this is somewhat accidental.  CSE is single pass, 
> presumably
> because it currently only tracks SETs of constants where any of the duplicates
> can be replaced by any alternative (it does pick the cheapest, but all the
> alternatives are valid.).
> 
> This breaks with vectors because vectors can only be used as a SRC.  The code
> does validate that the resulting CSE is valid, so this does not break.
> 
> but if the INSN are flipped in RTL:
> 
> (insn 14 13 15 2 (set (reg:DI 103)
>         (const_int 667169396713799798 [0x942430810234076])) "cse.c":8:12 -1
>      (nil))
> ...
> (insn 8 5 9 2 (set (reg:V2DI 99)
>         (const_vector:V2DI [
>                 (const_int 667169396713799798 [0x942430810234076]) repeated x2
>             ])) "cse.c":7:12 -1
>      (nil))
> 
> This no longer works, because it sees the constant version in insn 14 before 
> it
> sees insn 8.  When we find insn 8 we can tell that there is an instruction 
> that
> can be replaced by insn 8, but we don't know the original insn and so as a
> consequence we can't update it.
> 
> so questions:
> 
> 1) Does what I'm doing make sense?
> 2) Is there anyway to go from a SET to an insn?
> 3) If not, can I store the insn in table_elt and have cse_insn produce a 
> worklist
>    of additional insn that need to be re-examined?


Without being able to comment on RTL or the CSE implementation the
issue at hand (optimizing constant generation / placement) doesn't
fit CSE well but it's more a global LCM/PRE problem.  There's also
the issue that while on x86 many constants _are_ valid as immediates
CSEing them into a register (if one is available!) is still
profitable but RTL passes generally propagate / duplicate them
back into the instructions where they are valid (so "fixing" things
on GIMPLE generally doesn't work).

Also IIRC targets can delegitmize constants late (during reload/LRA)
which might cause extra complication.

Richard.

> Thanks,
> Tamar
> 
> --- inline copy of patch -- 
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

Re: [RFC] middle-end: Extend CSE to understand vector extracts.

Reply via email to