Re: [RFC] middle-end: Extend CSE to understand vector extracts.

Jeff Law via Gcc-patches Sun, 22 Aug 2021 20:39:17 -0700



On 1/4/2021 6:18 AM, Tamar Christina wrote:

Hi All,

I am trying to get CSE to re-use constants already inside a vector rather than
re-materializing the constant again.

Basically consider the following case:

#include <stdint.h>
#include <arm_neon.h>

uint64_t
test (uint64_t a, uint64x2_t b, uint64x2_t* rt)
{
   uint64_t arr[2] = { 0x0942430810234076UL, 0x0942430810234076UL};
   uint64_t res = a | arr[0];
   uint64x2_t val = vld1q_u64 (arr);
   *rt = vaddq_u64 (val, b);
   return res;
}

The actual behavior is inconsequential however notice that the same constants
are used in the vector (arr and later val) and in the calculation of res.

The code we generate for this however is quite sub-optimal:

test:
         adrp    x2, .LC0
         sub     sp, sp, #16
         ldr     q1, [x2, #:lo12:.LC0]
         mov     x2, 16502
         movk    x2, 0x1023, lsl 16
         movk    x2, 0x4308, lsl 32
         add     v1.2d, v1.2d, v0.2d
         movk    x2, 0x942, lsl 48
         orr     x0, x0, x2
         str     q1, [x1]
         add     sp, sp, 16
         ret
.LC0:
         .xword  667169396713799798
         .xword  667169396713799798

Essentially we materialize the same constant twice.  The reason for this is
because the front-end lowers the constant extracted from arr[0] quite early on.
If you look into the result of fre you'll find

   <bb 2> :
   arr[0] = 667169396713799798;
   arr[1] = 667169396713799798;
   res_7 = a_6(D) | 667169396713799798;
   _16 = __builtin_aarch64_ld1v2di (&arr);
   _17 = VIEW_CONVERT_EXPR<uint64x2_t>(_16);
   _11 = b_10(D) + _17;
   *rt_12(D) = _11;
   arr ={v} {CLOBBER};
   return res_7;

Which makes sense for further optimization.  However come expand time if the
constant isn't representable in the target arch it will be assigned to a
register again.

(insn 8 5 9 2 (set (reg:V2DI 99)
         (const_vector:V2DI [
                 (const_int 667169396713799798 [0x942430810234076]) repeated x2
             ])) "cse.c":7:12 -1
      (nil))
...
(insn 14 13 15 2 (set (reg:DI 103)
         (const_int 667169396713799798 [0x942430810234076])) "cse.c":8:12 -1
      (nil))
(insn 15 14 16 2 (set (reg:DI 102 [ res ])
         (ior:DI (reg/v:DI 96 [ a ])
             (reg:DI 103))) "cse.c":8:12 -1
      (nil))

So I think the key here is to be able to hash the elements of theconst_vector to the same value as the const_int. If you can hash themthe same, they'll be seen as common subexpressions regardless of theorder in which the insns appear.


My current patch for CSE is:

diff --git a/gcc/cse.c b/gcc/cse.c
index 36bcfc354d8..3cee53bed85 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -43,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
  #include "rtl-iter.h"
  #include "regs.h"
  #include "function-abi.h"
+#include "expr.h"

  /* The basic idea of common subexpression elimination is to go
     through the code, keeping a record of expressions that would
@@ -4306,6 +4307,20 @@ find_sets_in_insn (rtx_insn *insn, struct set **psets)
          someplace else, so it isn't worth cse'ing.  */
        else if (GET_CODE (SET_SRC (x)) == CALL)
         ;
+      else if (GET_CODE (SET_SRC (x)) == CONST_VECTOR)
+       {
+         /* First register the vector itself.  */
+         sets[n_sets++].rtl = x;
+         rtx src = SET_SRC (x);
+         machine_mode elem_mode = GET_MODE_INNER (GET_MODE (src));
+          /* Go over the constants of the CONST_VECTOR in forward order, to
+            put them in the same order in the SETS array.  */
+         for (unsigned i = 0; i < const_vector_encoded_nelts (src) ; i++)
+           {
+             rtx y = gen_rtx_SUBREG (elem_mode, SET_DEST (x), i);
+             sets[n_sets++].rtl = PATTERN (gen_move_insn (y, CONST_VECTOR_ELT 
(src, i)));
+           }
+       }
        else
         sets[n_sets++].rtl = x;
      }
@@ -4545,7 +4560,14 @@ cse_insn (rtx_insn *insn)
    struct set *sets = (struct set *) 0;

    if (GET_CODE (x) == SET)
-    sets = XALLOCA (struct set);
+    {
+      /* For CONST_VECTOR we wants to be able to CSE the vector itself along 
with
+        elements inside the vector if the target says it's cheap.  */
+      if (GET_CODE (SET_SRC (x)) == CONST_VECTOR)
+       sets = XALLOCAVEC (struct set, const_vector_encoded_nelts (SET_SRC (x)) 
+ 1);
+      else
+       sets = XALLOCA (struct set);
+    }
    else if (GET_CODE (x) == PARALLEL)
      sets = XALLOCAVEC (struct set, XVECLEN (x, 0));

--

This extends the sets that CSE uses to perform CSE to not only contain the
CONST_VECTOR but also the individual elements of the vector.

Seems conceptually reasonable. You probably want something similar toallow you to replace those elements in the vector as well.


For each element I generate new RTL which models them as a constant being set
into a subreg of the original vector at the index of the element in the vector.

This so that the SRC is the constant we want to CSE and DEST contains the
SUBREG to extract from the vector.

It works as expected, the testcase above generates:

test:
         adrp    x2, .LC0
         sub     sp, sp, #16
         ldr     q1, [x2, #:lo12:.LC0]
         add     v0.2d, v1.2d, v0.2d
         fmov    x2, d1
         str     q0, [x1]
         orr     x0, x0, x2
         add     sp, sp, 16
         ret
.LC0:
         .xword  667169396713799798
         .xword  667169396713799798

The problem is that this is somewhat accidental.  CSE is single pass, presumably
because it currently only tracks SETs of constants where any of the duplicates
can be replaced by any alternative (it does pick the cheapest, but all the
alternatives are valid.).

This breaks with vectors because vectors can only be used as a SRC.  The code
does validate that the resulting CSE is valid, so this does not break.

but if the INSN are flipped in RTL:

(insn 14 13 15 2 (set (reg:DI 103)
         (const_int 667169396713799798 [0x942430810234076])) "cse.c":8:12 -1
      (nil))
...
(insn 8 5 9 2 (set (reg:V2DI 99)
         (const_vector:V2DI [
                 (const_int 667169396713799798 [0x942430810234076]) repeated x2
             ])) "cse.c":7:12 -1
      (nil))

This no longer works, because it sees the constant version in insn 14 before it
sees insn 8.  When we find insn 8 we can tell that there is an instruction that
can be replaced by insn 8, but we don't know the original insn and so as a
consequence we can't update it.

Well, what I think you need to do in this case is look inside theconst_vector at each element and see if you get a hit in the hash table.


so questions:

1) Does what I'm doing make sense?

Yes so far.

2) Is there anyway to go from a SET to an insn?

Not really sure what you're asking. Are you trying to map from a setback to the insn with the set?

3) If not, can I store the insn in table_elt and have cse_insn produce a 
worklist
    of additional insn that need to be re-examined?

I think that's an indication you're going the wrong direction for thereversed case.


jeff

Re: [RFC] middle-end: Extend CSE to understand vector extracts.

Reply via email to