> -----Original Message----- > From: Richard Sandiford <richard.sandif...@arm.com> > Sent: Tuesday, August 31, 2021 5:07 PM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw > <richard.earns...@arm.com>; Marcus Shawcroft > <marcus.shawcr...@arm.com>; Kyrylo Tkachov <kyrylo.tkac...@arm.com> > Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector constants > and operations > > Tamar Christina <tamar.christ...@arm.com> writes: > >> -----Original Message----- > >> From: Richard Sandiford <richard.sandif...@arm.com> > >> Sent: Tuesday, August 31, 2021 4:14 PM > >> To: Tamar Christina <tamar.christ...@arm.com> > >> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw > >> <richard.earns...@arm.com>; Marcus Shawcroft > >> <marcus.shawcr...@arm.com>; Kyrylo Tkachov > <kyrylo.tkac...@arm.com> > >> Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector > >> constants and operations > >> > >> Tamar Christina <tamar.christ...@arm.com> writes: > >> > @@ -13936,8 +13937,65 @@ cost_plus: > >> > mode, MULT, 1, speed); > >> > return true; > >> > } > >> > + break; > >> > + case PARALLEL: > >> > + /* Fall through */ > >> > >> Which code paths lead to getting a PARALLEL here? > > > > Hi, > > > > Thanks for the review! > > > > I added it for completeness because CSE treats a parallel and > > CONST_VECTOR as equivalent when they each entry in the parallel defines > a constant. > > Could you test whether it ever triggers in practice though? > The code would be much simpler without it.
Will check π > > >> > + case CONST_VECTOR: > >> > + { > >> > + rtx gen_insn = aarch64_simd_make_constant (x, true); > >> > + /* Not a valid const vector. */ > >> > + if (!gen_insn) > >> > + break; > >> > > >> > - /* Fall through. */ > >> > + switch (GET_CODE (gen_insn)) > >> > + { > >> > + case CONST_VECTOR: > >> > + /* Load using MOVI/MVNI. */ > >> > + if (aarch64_simd_valid_immediate (x, NULL)) > >> > + *cost += extra_cost->vect.movi; > >> > + else /* Load using constant pool. */ > >> > + *cost += extra_cost->ldst.load; > >> > + break; > >> > + /* Load using a DUP. */ > >> > + case VEC_DUPLICATE: > >> > + *cost += extra_cost->vect.dup; > >> > + break; > >> > >> Does this trigger in practice? The new check==true path (rightly) > >> stops the duplicated element from being forced into a register, but > >> then I would have > >> expected: > >> > >> rtx > >> gen_vec_duplicate (machine_mode mode, rtx x) { > >> if (valid_for_const_vector_p (mode, x)) > >> return gen_const_vec_duplicate (mode, x); > >> return gen_rtx_VEC_DUPLICATE (mode, x); } > >> > >> to generate the original CONST_VECTOR again. > > > > Yes, but CSE is trying to see whether using a DUP is cheaper than another > instruction. > > Normal code won't hit this but CSE is just costing all the different > > ways one can semantically construct a vector, which RTL actually comes out > of it depends on how it's folded as you say. > > But what I mean is, you call: > > rtx gen_insn = aarch64_simd_make_constant (x, true); > /* Not a valid const vector. */ > if (!gen_insn) > break; > > where aarch64_simd_make_constant does: > > if (CONST_VECTOR_P (vals)) > const_vec = vals; > else if (GET_CODE (vals) == PARALLEL) > { > /* A CONST_VECTOR must contain only CONST_INTs and > CONST_DOUBLEs, but CONSTANT_P allows more (e.g. SYMBOL_REF). > Only store valid constants in a CONST_VECTOR. */ > int n_elts = XVECLEN (vals, 0); > for (i = 0; i < n_elts; ++i) > { > rtx x = XVECEXP (vals, 0, i); > if (CONST_INT_P (x) || CONST_DOUBLE_P (x)) > n_const++; > } > if (n_const == n_elts) > const_vec = gen_rtx_CONST_VECTOR (mode, XVEC (vals, 0)); > } > else > gcc_unreachable (); > > if (const_vec != NULL_RTX > && aarch64_simd_valid_immediate (const_vec, NULL)) > /* Load using MOVI/MVNI. */ > return const_vec; > else if ((const_dup = aarch64_simd_dup_constant (vals, check)) != > NULL_RTX) > /* Loaded using DUP. */ > return const_dup; > > and aarch64_simd_dup_constant does: > > machine_mode mode = GET_MODE (vals); > machine_mode inner_mode = GET_MODE_INNER (mode); > rtx x; > > if (!const_vec_duplicate_p (vals, &x)) > return NULL_RTX; > > /* We can load this constant by using DUP and a constant in a > single ARM register. This will be cheaper than a vector > load. */ > if (!check) > x = copy_to_mode_reg (inner_mode, x); > return gen_vec_duplicate (mode, x); > > For the βcheckβ case, βxβ will be a constant, and so gen_vec_duplicate will > call > gen_const_vec_duplicate, which will return a CONST_VECTOR. > It didn't seem to be possible for gen_insn to be a VEC_DUPLICATE. > Yes, but CSE can ask the cost of a VEC_DUPLICATE directly on a register without going through gen_const_vec_duplicate which is intended as the gen_ functions can have side effects (e.g. creating new psuedos etc) If say it sees a constant x and a vector [x x x x] it wants to know what the cost keeping x and materializing [x x x x] vs doing a duplicate of x into [x x x x] is. In this case since both the constant and the vectors are needed you won't get a constant there but a register so you'll actually see a vec_dup. If CSE pushes in the constant that would defeat the point π. Right now it's CSE that's pushing constants of vec_dup into vec_constants. My change is making it explicitly ask for the cost of doing this instead of assuming it always cheaper because for a large majority of cases it's not actually cheaper and is highly dependent on the targets ability to create said constant. So this hook will see both versions, the dup of the register and the vec_constant while CSE is trying to decide which one to keep. > This would be much simpler if we could call aarch64_simd_valid_immediate > and aarch64_simd_dup_constant directly from the rtx cost code, Agreed... I tried to separate them before, but the logic was annoying to split and I thought not worth the effort, so instead I just changed it to have a checking only mode. > hence the > question about whether the PARALLEL stuff was really needed in practice. > > >> > + default: > >> > + *cost += extra_cost->ldst.load; > >> > + break; > >> > + } > >> > + return true; > >> > + } > >> > + case VEC_CONCAT: > >> > + /* depending on the operation, either DUP or INS. > >> > + For now, keep default costing. */ > >> > + break; > >> > + case VEC_DUPLICATE: > >> > + *cost += extra_cost->vect.dup; > >> > + return true; > >> > + case VEC_SELECT: > >> > + { > >> > + /* cost subreg of 0 as free, otherwise as DUP */ > >> > + rtx op1 = XEXP (x, 1); > >> > + int nelts; > >> > + if ((op1 == const0_rtx && !BYTES_BIG_ENDIAN) > >> > + || (BYTES_BIG_ENDIAN > >> > + && GET_MODE_NUNITS (mode).is_constant(&nelts) > >> > + && INTVAL (op1) == nelts - 1)) > >> > + ; > >> > + else if (vec_series_lowpart_p (mode, GET_MODE (op1), op1)) > >> > + ; > >> > + else if (vec_series_highpart_p (mode, GET_MODE (op1), op1)) > >> > + /* Selecting the high part is not technically free, but we > >> > lack > >> > + enough information to decide that here. For instance > >> > selecting > >> > + the high-part of a vec_dup *is* free or to feed into any > >> > _high > >> > + instruction. Both of which we can't really tell. That > >> > said > >> > + have a better chance to optimize an dup vs multiple > >> > constants. */ > >> > + ; > >> > >> Not sure about this. We already try to detect the latter case (_high > >> instructions) via aarch64_strip_extend_vec_half. We might be missing > >> some cases, but that still feels like the right way to go IMO. > > > > That's a different problem from what I understand. What this is > > trying to say is that If you have a vector [x y a b] and you need > > vector [x y] that you can use the top part of the original vector for this. > > > > This is an approximation, because something that can be created with a > > movi is probably Cheaper to keep distinct if it's not going to be paired > > with a > _high operation (since you will have a dup then). > > > > The problem is that the front end has already spit the two Vectors into [x y > a b] and [x y]. > > There's nothing else that tries to consolidate them back up if both survive. > > > > As a consequence of this, the testcase test0 is not handled optimally. > > It would instead create > > 2 vectors, both of movi 0x3, just one being 64-bits and one being 128-bits. > > > > So if the cost of selecting it is cheaper than the movi, cse will not > > consolidate the vectors, and because movi's are so cheap, the only > > cost that worked was 0. But increasing the costs of movi's requires the > costs of everything to be increased (including loads). > > > > I preferred to 0 out the cost, because the worst that can happen is an > > dup instead of a movi, And at best a dup instead of a load from a pool (if > the constant is complicated). > > Hmm, will need to look at this more tomorrow. > > >> Selecting the high part of a vec_dup should get folded into another > vec_dup. > >> > >> The lowpart bits look OK, but which paths call this function without > >> first simplifying the select to a subreg? The subreg is now the > >> canonical form (thanks to r12-2288). > > > > The simplification will happen during folding in cse or in combine. > > This costing happens before the folding, When CSE is trying to decide > whether to undo the front end's lowering of constants. > > > > To do so it models the constants and the semantic operation required > > to extract them. E.g. to get > > 2 out of [0 2 4 5] it would need a VEC_SELECT of 1. And I don't treat > > the first element/bottom part special Here. Costing wise they would be > the same. > > But which code path creates the VEC_SELECT? We don't need any context to > know that the VEC_SELECT is non-canonical. It's obvious from the operands > of the VEC_SELECT in isolation. The non-cannonical RTL is never generated. I assume we're talking about the 0 case here Since subregs can't select arbitrary elements (as I asked before). For the 0 case it's only temporarily modelled as such as such to keep the CSE alternative costing simple. Currently it's just a for loop for I = 0 to vec_elems. When it comes time to generate the actual insn fold_rtx is called which will fold the VEC_SELECT Into a subreg. So it's never emitted into the instruction stream in its non canonical form. > > I'd just rather tackle this at source than try to get the cost code to handle > non-canonical rtl. If that's what is preferred I can change the CSE patch to generate a subreg for the 0 case, I'm not sure I agree with it as CSE is just trying to ask "what Is the cost of selecting the element 0 in this vector". And as I mentioned before it never emits the instruction unfolded. This representation seems to a more logical representation for costing to me. It's however unfortunate that there's only one costing callback, as far as CSE is concerned the representation/form doesn't matter, it's just looking at the high level operation. Or is the concern here that most targets will have costing for subreg 0 but not VEC_SELECT? In which case without Actually handling the costs of the other operations the CSE changes won't do anything for targets anyway. And it would be odd for a target to cost VEC_SELECT from 1 to <N> instead of just costing 0 too. Regards, Tamar > > Thanks, > Richard