On Mon, 4 Jan 2021, Tamar Christina wrote: > Hi Richi, > > > -----Original Message----- > > From: Richard Biener <rguent...@suse.de> > > Sent: Monday, January 4, 2021 1:33 PM > > To: Tamar Christina <tamar.christ...@arm.com> > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; i...@airs.com; > > l...@redhat.com > > Subject: Re: [RFC] middle-end: Extend CSE to understand vector extracts. > > > > On Mon, 4 Jan 2021, Tamar Christina wrote: > > > > > Hi All, > > > > > > I am trying to get CSE to re-use constants already inside a vector > > > rather than re-materializing the constant again. > > > > > > Basically consider the following case: > > > > > > #include <stdint.h> > > > #include <arm_neon.h> > > > > > > uint64_t > > > test (uint64_t a, uint64x2_t b, uint64x2_t* rt) { > > > uint64_t arr[2] = { 0x0942430810234076UL, 0x0942430810234076UL}; > > > uint64_t res = a | arr[0]; > > > uint64x2_t val = vld1q_u64 (arr); > > > *rt = vaddq_u64 (val, b); > > > return res; > > > } > > > > > > The actual behavior is inconsequential however notice that the same > > > constants are used in the vector (arr and later val) and in the > > > calculation of > > res. > > > > > > The code we generate for this however is quite sub-optimal: > > > > > > test: > > > adrp x2, .LC0 > > > sub sp, sp, #16 > > > ldr q1, [x2, #:lo12:.LC0] > > > mov x2, 16502 > > > movk x2, 0x1023, lsl 16 > > > movk x2, 0x4308, lsl 32 > > > add v1.2d, v1.2d, v0.2d > > > movk x2, 0x942, lsl 48 > > > orr x0, x0, x2 > > > str q1, [x1] > > > add sp, sp, 16 > > > ret > > > .LC0: > > > .xword 667169396713799798 > > > .xword 667169396713799798 > > > > > > Essentially we materialize the same constant twice. The reason for > > > this is because the front-end lowers the constant extracted from arr[0] > > quite early on. > > > If you look into the result of fre you'll find > > > > > > <bb 2> : > > > arr[0] = 667169396713799798; > > > arr[1] = 667169396713799798; > > > res_7 = a_6(D) | 667169396713799798; > > > _16 = __builtin_aarch64_ld1v2di (&arr); > > > _17 = VIEW_CONVERT_EXPR<uint64x2_t>(_16); > > > _11 = b_10(D) + _17; > > > *rt_12(D) = _11; > > > arr ={v} {CLOBBER}; > > > return res_7; > > > > > > Which makes sense for further optimization. However come expand time > > > if the constant isn't representable in the target arch it will be > > > assigned to a register again. > > > > > > (insn 8 5 9 2 (set (reg:V2DI 99) > > > (const_vector:V2DI [ > > > (const_int 667169396713799798 [0x942430810234076]) > > > repeated x2 > > > ])) "cse.c":7:12 -1 > > > (nil)) > > > ... > > > (insn 14 13 15 2 (set (reg:DI 103) > > > (const_int 667169396713799798 [0x942430810234076])) "cse.c":8:12 > > > -1 > > > (nil)) > > > (insn 15 14 16 2 (set (reg:DI 102 [ res ]) > > > (ior:DI (reg/v:DI 96 [ a ]) > > > (reg:DI 103))) "cse.c":8:12 -1 > > > (nil)) > > > > > > And since it's out of the immediate range of the scalar instruction > > > used combine won't be able to do anything here. > > > > > > This will then trigger the re-materialization of the constant twice. > > > > > > So I figured the best place to handle this is in CSE since in some > > > uArch it's far cheaper to extract a constant from a vector than to > > > materialize > > it. > > > > > > Particularly doing it pre-RA has the benefit of allowing RA to decide > > > whether it needs to move the constant between register files or not as > > > some uArch can perform scalar operation both on the SIMD and GENREG > > side. > > > > > > The issue is I don't know that much about CSE. I have been reading > > > through the source and think I have a basic understanding of how it > > > works but this email is to see if I'm on the right track or not (to > > > something that is acceptable upstream). > > > > > > My current patch for CSE is: > > > > > > diff --git a/gcc/cse.c b/gcc/cse.c > > > index 36bcfc354d8..3cee53bed85 100644 > > > --- a/gcc/cse.c > > > +++ b/gcc/cse.c > > > @@ -43,6 +43,7 @@ along with GCC; see the file COPYING3. If not see > > > #include "rtl-iter.h" > > > #include "regs.h" > > > #include "function-abi.h" > > > +#include "expr.h" > > > > > > /* The basic idea of common subexpression elimination is to go > > > through the code, keeping a record of expressions that would @@ > > > -4306,6 +4307,20 @@ find_sets_in_insn (rtx_insn *insn, struct set **psets) > > > someplace else, so it isn't worth cse'ing. */ > > > else if (GET_CODE (SET_SRC (x)) == CALL) > > > ; > > > + else if (GET_CODE (SET_SRC (x)) == CONST_VECTOR) > > > + { > > > + /* First register the vector itself. */ > > > + sets[n_sets++].rtl = x; > > > + rtx src = SET_SRC (x); > > > + machine_mode elem_mode = GET_MODE_INNER (GET_MODE (src)); > > > + /* Go over the constants of the CONST_VECTOR in forward order, > > > to > > > + put them in the same order in the SETS array. */ > > > + for (unsigned i = 0; i < const_vector_encoded_nelts (src) ; i++) > > > + { > > > + rtx y = gen_rtx_SUBREG (elem_mode, SET_DEST (x), i); > > > + sets[n_sets++].rtl = PATTERN (gen_move_insn (y, > > CONST_VECTOR_ELT (src, i))); > > > + } > > > + } > > > else > > > sets[n_sets++].rtl = x; > > > } > > > @@ -4545,7 +4560,14 @@ cse_insn (rtx_insn *insn) > > > struct set *sets = (struct set *) 0; > > > > > > if (GET_CODE (x) == SET) > > > - sets = XALLOCA (struct set); > > > + { > > > + /* For CONST_VECTOR we wants to be able to CSE the vector itself > > along with > > > + elements inside the vector if the target says it's cheap. */ > > > + if (GET_CODE (SET_SRC (x)) == CONST_VECTOR) > > > + sets = XALLOCAVEC (struct set, const_vector_encoded_nelts > > (SET_SRC (x)) + 1); > > > + else > > > + sets = XALLOCA (struct set); > > > + } > > > else if (GET_CODE (x) == PARALLEL) > > > sets = XALLOCAVEC (struct set, XVECLEN (x, 0)); > > > > > > -- > > > > > > This extends the sets that CSE uses to perform CSE to not only contain > > > the CONST_VECTOR but also the individual elements of the vector. > > > > > > For each element I generate new RTL which models them as a constant > > > being set into a subreg of the original vector at the index of the > > > element in > > the vector. > > > > > > This so that the SRC is the constant we want to CSE and DEST contains > > > the SUBREG to extract from the vector. > > > > > > It works as expected, the testcase above generates: > > > > > > test: > > > adrp x2, .LC0 > > > sub sp, sp, #16 > > > ldr q1, [x2, #:lo12:.LC0] > > > add v0.2d, v1.2d, v0.2d > > > fmov x2, d1 > > > str q0, [x1] > > > orr x0, x0, x2 > > > add sp, sp, 16 > > > ret > > > .LC0: > > > .xword 667169396713799798 > > > .xword 667169396713799798 > > > > > > The problem is that this is somewhat accidental. CSE is single pass, > > > presumably because it currently only tracks SETs of constants where > > > any of the duplicates can be replaced by any alternative (it does pick > > > the cheapest, but all the alternatives are valid.). > > > > > > This breaks with vectors because vectors can only be used as a SRC. > > > The code does validate that the resulting CSE is valid, so this does not > > > break. > > > > > > but if the INSN are flipped in RTL: > > > > > > (insn 14 13 15 2 (set (reg:DI 103) > > > (const_int 667169396713799798 [0x942430810234076])) "cse.c":8:12 > > > -1 > > > (nil)) > > > ... > > > (insn 8 5 9 2 (set (reg:V2DI 99) > > > (const_vector:V2DI [ > > > (const_int 667169396713799798 [0x942430810234076]) > > > repeated x2 > > > ])) "cse.c":7:12 -1 > > > (nil)) > > > > > > This no longer works, because it sees the constant version in insn 14 > > > before it sees insn 8. When we find insn 8 we can tell that there is > > > an instruction that can be replaced by insn 8, but we don't know the > > > original insn and so as a consequence we can't update it. > > > > > > so questions: > > > > > > 1) Does what I'm doing make sense? > > > 2) Is there anyway to go from a SET to an insn? > > > 3) If not, can I store the insn in table_elt and have cse_insn produce a > > worklist > > > of additional insn that need to be re-examined? > > > > Without being able to comment on RTL or the CSE implementation the issue > > at hand (optimizing constant generation / placement) doesn't fit CSE well > > but > > it's more a global LCM/PRE problem. > > Hmm that's fair, I can try using PRE. I initially chose CSE since it already > did the majority > of the work to support PARALLELs already. > > > There's also the issue that while on x86 > > many constants _are_ valid as immediates CSEing them into a register (if one > > is available!) is still profitable but RTL passes generally propagate / > > duplicate > > them back into the instructions where they are valid (so "fixing" things on > > GIMPLE generally doesn't work). > > I was going to make this a target hook so the back-end can decide what it > wants to do, > I just didn't do that yet. It would have to be, even for PRE wouldn't it? > > I agree that at GIMPLE it wouldn't work but CSE always runs at RTL no?
Yes. > > > > Also IIRC targets can delegitmize constants late (during reload/LRA) which > > might cause extra complication. > > True, but doing it post-reloads has the issue that reload has then already > chosen a register class, > which makes it not able to generate the most efficient code anymore. True - I thought of a pass just before IRA/LRA that splits out constants from all insns that can bear a register operand in its place placing the init of the pseudo using LCM dataflow (and doing CSE plus magic for dealing with the vector component case). I think LRA can already rematerialize a constant in the insn (aka propagate it back) in case the def of the constant didn't get a hardreg. One could, after LCM, do trivial propagation of single-use defs back to the insns as well. > For this simple case with a vec_dup I can of course fix this by changing the > representation at expand time > from being a vec_dup of a constant to that of a register and shove the > constant in the register. > > Which would work for the dups case but not the general case of extracting any > element. I could again change > The representation to be a concat of a bunch of registers, but at some point > they have to go back in. I think for your case at hand the CSE approach is fine (no comments on the details) - just that CSE isn't going to be the place to fix all cases (as you noticed with the second one). Richard. > Cheers, > Tamar > > > > > Richard. > > > > > Thanks, > > > Tamar > > > > > > --- inline copy of patch -- > > > > > > > > > > -- > > Richard Biener <rguent...@suse.de> > > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 > > Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg) > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)