RE: [RFC] middle-end: Extend CSE to understand vector extracts.

Richard Biener Mon, 04 Jan 2021 06:13:22 -0800

On Mon, 4 Jan 2021, Tamar Christina wrote:

> Hi Richi, 
> 
> > -----Original Message-----
> > From: Richard Biener <rguent...@suse.de>
> > Sent: Monday, January 4, 2021 1:33 PM
> > To: Tamar Christina <tamar.christ...@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; i...@airs.com;
> > l...@redhat.com
> > Subject: Re: [RFC] middle-end: Extend CSE to understand vector extracts.
> > 
> > On Mon, 4 Jan 2021, Tamar Christina wrote:
> > 
> > > Hi All,
> > >
> > > I am trying to get CSE to re-use constants already inside a vector
> > > rather than re-materializing the constant again.
> > >
> > > Basically consider the following case:
> > >
> > > #include <stdint.h>
> > > #include <arm_neon.h>
> > >
> > > uint64_t
> > > test (uint64_t a, uint64x2_t b, uint64x2_t* rt) {
> > >   uint64_t arr[2] = { 0x0942430810234076UL, 0x0942430810234076UL};
> > >   uint64_t res = a | arr[0];
> > >   uint64x2_t val = vld1q_u64 (arr);
> > >   *rt = vaddq_u64 (val, b);
> > >   return res;
> > > }
> > >
> > > The actual behavior is inconsequential however notice that the same
> > > constants are used in the vector (arr and later val) and in the 
> > > calculation of
> > res.
> > >
> > > The code we generate for this however is quite sub-optimal:
> > >
> > > test:
> > >         adrp    x2, .LC0
> > >         sub     sp, sp, #16
> > >         ldr     q1, [x2, #:lo12:.LC0]
> > >         mov     x2, 16502
> > >         movk    x2, 0x1023, lsl 16
> > >         movk    x2, 0x4308, lsl 32
> > >         add     v1.2d, v1.2d, v0.2d
> > >         movk    x2, 0x942, lsl 48
> > >         orr     x0, x0, x2
> > >         str     q1, [x1]
> > >         add     sp, sp, 16
> > >         ret
> > > .LC0:
> > >         .xword  667169396713799798
> > >         .xword  667169396713799798
> > >
> > > Essentially we materialize the same constant twice.  The reason for
> > > this is because the front-end lowers the constant extracted from arr[0]
> > quite early on.
> > > If you look into the result of fre you'll find
> > >
> > >   <bb 2> :
> > >   arr[0] = 667169396713799798;
> > >   arr[1] = 667169396713799798;
> > >   res_7 = a_6(D) | 667169396713799798;
> > >   _16 = __builtin_aarch64_ld1v2di (&arr);
> > >   _17 = VIEW_CONVERT_EXPR<uint64x2_t>(_16);
> > >   _11 = b_10(D) + _17;
> > >   *rt_12(D) = _11;
> > >   arr ={v} {CLOBBER};
> > >   return res_7;
> > >
> > > Which makes sense for further optimization.  However come expand time
> > > if the constant isn't representable in the target arch it will be
> > > assigned to a register again.
> > >
> > > (insn 8 5 9 2 (set (reg:V2DI 99)
> > >         (const_vector:V2DI [
> > >                 (const_int 667169396713799798 [0x942430810234076]) 
> > > repeated x2
> > >             ])) "cse.c":7:12 -1
> > >      (nil))
> > > ...
> > > (insn 14 13 15 2 (set (reg:DI 103)
> > >         (const_int 667169396713799798 [0x942430810234076])) "cse.c":8:12 
> > > -1
> > >      (nil))
> > > (insn 15 14 16 2 (set (reg:DI 102 [ res ])
> > >         (ior:DI (reg/v:DI 96 [ a ])
> > >             (reg:DI 103))) "cse.c":8:12 -1
> > >      (nil))
> > >
> > > And since it's out of the immediate range of the scalar instruction
> > > used combine won't be able to do anything here.
> > >
> > > This will then trigger the re-materialization of the constant twice.
> > >
> > > So I figured the best place to handle this is in CSE since in some
> > > uArch it's far cheaper to extract a constant from a vector than to 
> > > materialize
> > it.
> > >
> > > Particularly doing it pre-RA has the benefit of allowing RA to decide
> > > whether it needs to move the constant between register files or not as
> > > some uArch can perform scalar operation both on the SIMD and GENREG
> > side.
> > >
> > > The issue is I don't know that much about CSE.  I have been reading
> > > through the source and think I have a basic understanding of how it
> > > works but this email is to see if I'm on the right track or not (to
> > > something that is acceptable upstream).
> > >
> > > My current patch for CSE is:
> > >
> > > diff --git a/gcc/cse.c b/gcc/cse.c
> > > index 36bcfc354d8..3cee53bed85 100644
> > > --- a/gcc/cse.c
> > > +++ b/gcc/cse.c
> > > @@ -43,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
> > > #include "rtl-iter.h"
> > >  #include "regs.h"
> > >  #include "function-abi.h"
> > > +#include "expr.h"
> > >
> > >  /* The basic idea of common subexpression elimination is to go
> > >     through the code, keeping a record of expressions that would @@
> > > -4306,6 +4307,20 @@ find_sets_in_insn (rtx_insn *insn, struct set **psets)
> > >          someplace else, so it isn't worth cse'ing.  */
> > >        else if (GET_CODE (SET_SRC (x)) == CALL)
> > >         ;
> > > +      else if (GET_CODE (SET_SRC (x)) == CONST_VECTOR)
> > > +       {
> > > +         /* First register the vector itself.  */
> > > +         sets[n_sets++].rtl = x;
> > > +         rtx src = SET_SRC (x);
> > > +         machine_mode elem_mode = GET_MODE_INNER (GET_MODE (src));
> > > +          /* Go over the constants of the CONST_VECTOR in forward order, 
> > > to
> > > +            put them in the same order in the SETS array.  */
> > > +         for (unsigned i = 0; i < const_vector_encoded_nelts (src) ; i++)
> > > +           {
> > > +             rtx y = gen_rtx_SUBREG (elem_mode, SET_DEST (x), i);
> > > +             sets[n_sets++].rtl = PATTERN (gen_move_insn (y,
> > CONST_VECTOR_ELT (src, i)));
> > > +           }
> > > +       }
> > >        else
> > >         sets[n_sets++].rtl = x;
> > >      }
> > > @@ -4545,7 +4560,14 @@ cse_insn (rtx_insn *insn)
> > >    struct set *sets = (struct set *) 0;
> > >
> > >    if (GET_CODE (x) == SET)
> > > -    sets = XALLOCA (struct set);
> > > +    {
> > > +      /* For CONST_VECTOR we wants to be able to CSE the vector itself
> > along with
> > > +        elements inside the vector if the target says it's cheap.  */
> > > +      if (GET_CODE (SET_SRC (x)) == CONST_VECTOR)
> > > +       sets = XALLOCAVEC (struct set, const_vector_encoded_nelts
> > (SET_SRC (x)) + 1);
> > > +      else
> > > +       sets = XALLOCA (struct set);
> > > +    }
> > >    else if (GET_CODE (x) == PARALLEL)
> > >      sets = XALLOCAVEC (struct set, XVECLEN (x, 0));
> > >
> > > --
> > >
> > > This extends the sets that CSE uses to perform CSE to not only contain
> > > the CONST_VECTOR but also the individual elements of the vector.
> > >
> > > For each element I generate new RTL which models them as a constant
> > > being set into a subreg of the original vector at the index of the 
> > > element in
> > the vector.
> > >
> > > This so that the SRC is the constant we want to CSE and DEST contains
> > > the SUBREG to extract from the vector.
> > >
> > > It works as expected, the testcase above generates:
> > >
> > > test:
> > >         adrp    x2, .LC0
> > >         sub     sp, sp, #16
> > >         ldr     q1, [x2, #:lo12:.LC0]
> > >         add     v0.2d, v1.2d, v0.2d
> > >         fmov    x2, d1
> > >         str     q0, [x1]
> > >         orr     x0, x0, x2
> > >         add     sp, sp, 16
> > >         ret
> > > .LC0:
> > >         .xword  667169396713799798
> > >         .xword  667169396713799798
> > >
> > > The problem is that this is somewhat accidental.  CSE is single pass,
> > > presumably because it currently only tracks SETs of constants where
> > > any of the duplicates can be replaced by any alternative (it does pick
> > > the cheapest, but all the alternatives are valid.).
> > >
> > > This breaks with vectors because vectors can only be used as a SRC.
> > > The code does validate that the resulting CSE is valid, so this does not 
> > > break.
> > >
> > > but if the INSN are flipped in RTL:
> > >
> > > (insn 14 13 15 2 (set (reg:DI 103)
> > >         (const_int 667169396713799798 [0x942430810234076])) "cse.c":8:12 
> > > -1
> > >      (nil))
> > > ...
> > > (insn 8 5 9 2 (set (reg:V2DI 99)
> > >         (const_vector:V2DI [
> > >                 (const_int 667169396713799798 [0x942430810234076]) 
> > > repeated x2
> > >             ])) "cse.c":7:12 -1
> > >      (nil))
> > >
> > > This no longer works, because it sees the constant version in insn 14
> > > before it sees insn 8.  When we find insn 8 we can tell that there is
> > > an instruction that can be replaced by insn 8, but we don't know the
> > > original insn and so as a consequence we can't update it.
> > >
> > > so questions:
> > >
> > > 1) Does what I'm doing make sense?
> > > 2) Is there anyway to go from a SET to an insn?
> > > 3) If not, can I store the insn in table_elt and have cse_insn produce a
> > worklist
> > >    of additional insn that need to be re-examined?
> > 
> > Without being able to comment on RTL or the CSE implementation the issue
> > at hand (optimizing constant generation / placement) doesn't fit CSE well 
> > but
> > it's more a global LCM/PRE problem.
> 
> Hmm that's fair, I can try using PRE.  I initially chose CSE since it already 
> did the majority
> of the work to support PARALLELs already.
> 
> > There's also the issue that while on x86
> > many constants _are_ valid as immediates CSEing them into a register (if one
> > is available!) is still profitable but RTL passes generally propagate / 
> > duplicate
> > them back into the instructions where they are valid (so "fixing" things on
> > GIMPLE generally doesn't work).
> 
> I was going to make this a target hook so the back-end can decide what it 
> wants to do,
> I just didn't do that yet. It would have to be, even for PRE wouldn't it?
> 
> I agree that at GIMPLE it wouldn't work but CSE always runs at RTL no?


Yes.

> > 
> > Also IIRC targets can delegitmize constants late (during reload/LRA) which
> > might cause extra complication.
> 
> True, but doing it post-reloads has the issue that reload has then already 
> chosen a register class,
> which makes it not able to generate the most efficient code anymore.

True - I thought of a pass just before IRA/LRA that splits out constants
from all insns that can bear a register operand in its place placing the
init of the pseudo using LCM dataflow (and doing CSE plus magic for 
dealing with the vector component case).  I think LRA can already
rematerialize a constant in the insn (aka propagate it back) in case
the def of the constant didn't get a hardreg.  One could, after LCM,
do trivial propagation of single-use defs back to the insns as well.

> For this simple case with a vec_dup I can of course fix this by changing the 
> representation at expand time
> from being a vec_dup of a constant to that of a register and shove the 
> constant in the register.
> 
> Which would work for the dups case but not the general case of extracting any 
> element.  I could again change
> The representation to be a concat of a bunch of registers, but at some point 
> they have to go back in.

I think for your case at hand the CSE approach is fine (no comments on the
details) - just that CSE isn't going to be the place to fix all cases
(as you noticed with the second one).

Richard.

> Cheers,
> Tamar
> 
> > 
> > Richard.
> > 
> > > Thanks,
> > > Tamar
> > >
> > > --- inline copy of patch --
> > >
> > >
> > 
> > --
> > Richard Biener <rguent...@suse.de>
> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> > Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

RE: [RFC] middle-end: Extend CSE to understand vector extracts.

Reply via email to