> -----Original Message----- > From: Richard Sandiford <richard.sandif...@arm.com> > Sent: Tuesday, August 31, 2021 4:14 PM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw > <richard.earns...@arm.com>; Marcus Shawcroft > <marcus.shawcr...@arm.com>; Kyrylo Tkachov <kyrylo.tkac...@arm.com> > Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector constants > and operations > > Tamar Christina <tamar.christ...@arm.com> writes: > > @@ -13936,8 +13937,65 @@ cost_plus: > > mode, MULT, 1, speed); > > return true; > > } > > + break; > > + case PARALLEL: > > + /* Fall through */ > > Which code paths lead to getting a PARALLEL here?
Hi, Thanks for the review! I added it for completeness because CSE treats a parallel and CONST_VECTOR as equivalent when they each entry in the parallel defines a constant. > > > + case CONST_VECTOR: > > + { > > + rtx gen_insn = aarch64_simd_make_constant (x, true); > > + /* Not a valid const vector. */ > > + if (!gen_insn) > > + break; > > > > - /* Fall through. */ > > + switch (GET_CODE (gen_insn)) > > + { > > + case CONST_VECTOR: > > + /* Load using MOVI/MVNI. */ > > + if (aarch64_simd_valid_immediate (x, NULL)) > > + *cost += extra_cost->vect.movi; > > + else /* Load using constant pool. */ > > + *cost += extra_cost->ldst.load; > > + break; > > + /* Load using a DUP. */ > > + case VEC_DUPLICATE: > > + *cost += extra_cost->vect.dup; > > + break; > > Does this trigger in practice? The new check==true path (rightly) stops the > duplicated element from being forced into a register, but then I would have > expected: > > rtx > gen_vec_duplicate (machine_mode mode, rtx x) { > if (valid_for_const_vector_p (mode, x)) > return gen_const_vec_duplicate (mode, x); > return gen_rtx_VEC_DUPLICATE (mode, x); } > > to generate the original CONST_VECTOR again. Yes, but CSE is trying to see whether using a DUP is cheaper than another instruction. Normal code won't hit this but CSE is just costing all the different ways one can semantically construct a vector, which RTL actually comes out of it depends on how it's folded as you say. > > > + default: > > + *cost += extra_cost->ldst.load; > > + break; > > + } > > + return true; > > + } > > + case VEC_CONCAT: > > + /* depending on the operation, either DUP or INS. > > + For now, keep default costing. */ > > + break; > > + case VEC_DUPLICATE: > > + *cost += extra_cost->vect.dup; > > + return true; > > + case VEC_SELECT: > > + { > > + /* cost subreg of 0 as free, otherwise as DUP */ > > + rtx op1 = XEXP (x, 1); > > + int nelts; > > + if ((op1 == const0_rtx && !BYTES_BIG_ENDIAN) > > + || (BYTES_BIG_ENDIAN > > + && GET_MODE_NUNITS (mode).is_constant(&nelts) > > + && INTVAL (op1) == nelts - 1)) > > + ; > > + else if (vec_series_lowpart_p (mode, GET_MODE (op1), op1)) > > + ; > > + else if (vec_series_highpart_p (mode, GET_MODE (op1), op1)) > > + /* Selecting the high part is not technically free, but we lack > > + enough information to decide that here. For instance selecting > > + the high-part of a vec_dup *is* free or to feed into any _high > > + instruction. Both of which we can't really tell. That said > > + have a better chance to optimize an dup vs multiple constants. */ > > + ; > > Not sure about this. We already try to detect the latter case (_high > instructions) via aarch64_strip_extend_vec_half. We might be missing some > cases, but that still feels like the right way to go IMO. That's a different problem from what I understand. What this is trying to say is that If you have a vector [x y a b] and you need vector [x y] that you can use the top part of the original vector for this. This is an approximation, because something that can be created with a movi is probably Cheaper to keep distinct if it's not going to be paired with a _high operation (since you will have a dup then). The problem is that the front end has already spit the two Vectors into [x y a b] and [x y]. There's nothing else that tries to consolidate them back up if both survive. As a consequence of this, the testcase test0 is not handled optimally. It would instead create 2 vectors, both of movi 0x3, just one being 64-bits and one being 128-bits. So if the cost of selecting it is cheaper than the movi, cse will not consolidate the vectors, and because movi's are so cheap, the only cost that worked was 0. But increasing the costs of movi's requires the costs of everything to be increased (including loads). I preferred to 0 out the cost, because the worst that can happen is an dup instead of a movi, And at best a dup instead of a load from a pool (if the constant is complicated). > > Selecting the high part of a vec_dup should get folded into another vec_dup. > > The lowpart bits look OK, but which paths call this function without first > simplifying the select to a subreg? The subreg is now the canonical form > (thanks to r12-2288). The simplification will happen during folding in cse or in combine. This costing happens before the folding, When CSE is trying to decide whether to undo the front end's lowering of constants. To do so it models the constants and the semantic operation required to extract them. E.g. to get 2 out of [0 2 4 5] it would need a VEC_SELECT of 1. And I don't treat the first element/bottom part special Here. Costing wise they would be the same. Regards, Tamar > > Thanks, > Richard