Richard Sandiford <richard.sandif...@arm.com> writes: > Tamar Christina <tamar.christ...@arm.com> writes: >>> -----Original Message----- >>> From: Richard Sandiford <richard.sandif...@arm.com> >>> Sent: Thursday, July 4, 2024 12:46 PM >>> To: Tamar Christina <tamar.christ...@arm.com> >>> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw >>> <richard.earns...@arm.com>; Marcus Shawcroft >>> <marcus.shawcr...@arm.com>; ktkac...@gcc.gnu.org >>> Subject: Re: [PATCH 1/2]AArch64: make aarch64_simd_vec_unpack<su>_lo_/_hi_ >>> consistent. >>> >>> Tamar Christina <tamar.christ...@arm.com> writes: >>> > Hi All, >>> > >>> > The fix for PR18127 reworked the uxtl to zip optimization. >>> > In doing so it undid the changes in aarch64_simd_vec_unpack<su>_lo_ and >>> > this >>> now >>> > no longer matches aarch64_simd_vec_unpack<su>_hi_. It still works because >>> the >>> > RTL generated by aarch64_simd_vec_unpack<su>_lo_ overlaps with the general >>> zero >>> > extend RTL and so because that one is listed before the lo pattern recog >>> > picks >>> > it instead. >>> > >>> > This just makes aarch64_simd_vec_unpack<su>_lo_ mirror >>> > aarch64_simd_vec_unpack<su>_hi_ for consistency and so we're not relying >>> > on >>> the >>> > order of patterns. >>> > >>> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. >>> > >>> > Ok for master? >>> > >>> > Thanks, >>> > Tamar >>> > >>> > gcc/ChangeLog: >>> > >>> > * config/aarch64/aarch64-simd.md >>> > (aarch64_simd_vec_unpack<su>_lo_<mode>): Add same split as >>> > aarch64_simd_vec_unpack<su>_hi_. >>> > * config/aarch64/aarch64.cc (aarch64_gen_shareable_zero): Update >>> > comment. >>> > >>> > --- >>> > >>> > diff --git a/gcc/config/aarch64/aarch64-simd.md >>> b/gcc/config/aarch64/aarch64-simd.md >>> > index >>> 01b084d8ccb53fc0f1c7b0dd8f23546c331f020a..d4026cbf0b66995104e8e40ca >>> 294faaf8d5ba847 100644 >>> > --- a/gcc/config/aarch64/aarch64-simd.md >>> > +++ b/gcc/config/aarch64/aarch64-simd.md >>> > @@ -1904,7 +1904,7 @@ (define_insn >>> "*aarch64_<srn_op>topbits_shuffle<mode>_be" >>> > >>> > ;; Widening operations. >>> > >>> > -(define_insn "aarch64_simd_vec_unpack<su>_lo_<mode>" >>> > +(define_insn_and_split "aarch64_simd_vec_unpack<su>_lo_<mode>" >>> > [(set (match_operand:<VWIDE> 0 "register_operand" "=w") >>> > (ANY_EXTEND:<VWIDE> (vec_select:<VHALF> >>> > (match_operand:VQW 1 "register_operand" "w") >>> > @@ -1912,6 +1912,19 @@ (define_insn >>> "aarch64_simd_vec_unpack<su>_lo_<mode>" >>> > )))] >>> > "TARGET_SIMD" >>> > "<su>xtl\t%0.<Vwtype>, %1.<Vhalftype>" >>> > + "&& <CODE> == ZERO_EXTEND >>> > + && aarch64_split_simd_shift_p (insn)" >>> > + [(const_int 0)] >>> > + { >>> > + /* On many cores, it is cheaper to implement UXTL using a ZIP1 with >>> > zero, >>> > + provided that the cost of the zero can be amortized over several >>> > + operations. We'll later recombine the zero and zip if there are >>> > + not sufficient uses of the zero to make the split worthwhile. */ >>> > + rtx res = simplify_gen_subreg (<MODE>mode, operands[0], <VWIDE>mode, >>> 0); >>> > + rtx zero = aarch64_gen_shareable_zero (<MODE>mode); >>> > + emit_insn (gen_aarch64_zip1<mode> (res, operands[1], zero)); >>> > + DONE; >>> > + } >>> > [(set_attr "type" "neon_shift_imm_long")] >>> > ) >>> >>> I think we should just remove the pattern instead, and update users >>> to generate an extension of the 64-bit lowpart. That form is now the >>> canonical one, so the current aarch64_simd_vec_unpack<su>_lo_<mode> >>> pattern hard-codes unsimplified rtl. >>> >> >> Ok, but then it's very weird to have _hi still. The reason for these >> indirections >> seems to be to model the hi/lo split for RTL simplifications. >> >> The aarch64_simd_vec_unpack<su>_lo models a vec_select as well. This RTL is >> different from the one it gets rewritten to (dropping the vec_select). >> >> So dropping it means we lose this, and I guess maybe fold-rtx does it now? >> It's not immediately clear to me that dropping the additional RTL for recog >> is beneficially. > > The principle is that, say: > > (vec_select:V2SI (reg:V2DI R) (parallel [(const_int 0) (const_int 1)])) > > is (for little-endian) equivalent to: > > (subreg:V2SI (reg:V2DI R) 0)
Sigh, of course I meant V4SI rather than V2DI in the above :) > and similarly for the equivalent big-endian pair. Simplification rules > are now supposed to ensure that only the second (simpler) form is generated > by target-independent code. We should fix any cases where that doesn't > happen, since it would be a missed optimisation for any instructions > that take (in this case) V2SI inputs. > > There's no equivalent simplification for _hi because it isn't possible > to refer directly to the upper 64 bits of a 128-bit register using subregs. > > Thanks, > Richard