Richard Sandiford <richard.sandif...@arm.com> writes:
> Tamar Christina <tamar.christ...@arm.com> writes:
>>> -----Original Message-----
>>> From: Richard Sandiford <richard.sandif...@arm.com>
>>> Sent: Thursday, July 4, 2024 12:46 PM
>>> To: Tamar Christina <tamar.christ...@arm.com>
>>> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw
>>> <richard.earns...@arm.com>; Marcus Shawcroft
>>> <marcus.shawcr...@arm.com>; ktkac...@gcc.gnu.org
>>> Subject: Re: [PATCH 1/2]AArch64: make aarch64_simd_vec_unpack<su>_lo_/_hi_
>>> consistent.
>>> 
>>> Tamar Christina <tamar.christ...@arm.com> writes:
>>> > Hi All,
>>> >
>>> > The fix for PR18127 reworked the uxtl to zip optimization.
>>> > In doing so it undid the changes in aarch64_simd_vec_unpack<su>_lo_ and 
>>> > this
>>> now
>>> > no longer matches aarch64_simd_vec_unpack<su>_hi_.  It still works because
>>> the
>>> > RTL generated by aarch64_simd_vec_unpack<su>_lo_ overlaps with the general
>>> zero
>>> > extend RTL and so because that one is listed before the lo pattern recog 
>>> > picks
>>> > it instead.
>>> >
>>> > This just makes aarch64_simd_vec_unpack<su>_lo_ mirror
>>> > aarch64_simd_vec_unpack<su>_hi_ for consistency and so we're not relying 
>>> > on
>>> the
>>> > order of patterns.
>>> >
>>> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>>> >
>>> > Ok for master?
>>> >
>>> > Thanks,
>>> > Tamar
>>> >
>>> > gcc/ChangeLog:
>>> >
>>> >   * config/aarch64/aarch64-simd.md
>>> >   (aarch64_simd_vec_unpack<su>_lo_<mode>): Add same split as
>>> >   aarch64_simd_vec_unpack<su>_hi_.
>>> >   * config/aarch64/aarch64.cc (aarch64_gen_shareable_zero): Update
>>> >   comment.
>>> >
>>> > ---
>>> >
>>> > diff --git a/gcc/config/aarch64/aarch64-simd.md
>>> b/gcc/config/aarch64/aarch64-simd.md
>>> > index
>>> 01b084d8ccb53fc0f1c7b0dd8f23546c331f020a..d4026cbf0b66995104e8e40ca
>>> 294faaf8d5ba847 100644
>>> > --- a/gcc/config/aarch64/aarch64-simd.md
>>> > +++ b/gcc/config/aarch64/aarch64-simd.md
>>> > @@ -1904,7 +1904,7 @@ (define_insn
>>> "*aarch64_<srn_op>topbits_shuffle<mode>_be"
>>> >
>>> >  ;; Widening operations.
>>> >
>>> > -(define_insn "aarch64_simd_vec_unpack<su>_lo_<mode>"
>>> > +(define_insn_and_split "aarch64_simd_vec_unpack<su>_lo_<mode>"
>>> >    [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
>>> >          (ANY_EXTEND:<VWIDE> (vec_select:<VHALF>
>>> >                          (match_operand:VQW 1 "register_operand" "w")
>>> > @@ -1912,6 +1912,19 @@ (define_insn
>>> "aarch64_simd_vec_unpack<su>_lo_<mode>"
>>> >                       )))]
>>> >    "TARGET_SIMD"
>>> >    "<su>xtl\t%0.<Vwtype>, %1.<Vhalftype>"
>>> > +  "&& <CODE> == ZERO_EXTEND
>>> > +   && aarch64_split_simd_shift_p (insn)"
>>> > +  [(const_int 0)]
>>> > +  {
>>> > +    /* On many cores, it is cheaper to implement UXTL using a ZIP1 with 
>>> > zero,
>>> > +       provided that the cost of the zero can be amortized over several
>>> > +       operations.  We'll later recombine the zero and zip if there are
>>> > +       not sufficient uses of the zero to make the split worthwhile.  */
>>> > +    rtx res = simplify_gen_subreg (<MODE>mode, operands[0], <VWIDE>mode,
>>> 0);
>>> > +    rtx zero = aarch64_gen_shareable_zero (<MODE>mode);
>>> > +    emit_insn (gen_aarch64_zip1<mode> (res, operands[1], zero));
>>> > +    DONE;
>>> > +  }
>>> >    [(set_attr "type" "neon_shift_imm_long")]
>>> >  )
>>> 
>>> I think we should just remove the pattern instead, and update users
>>> to generate an extension of the 64-bit lowpart.  That form is now the
>>> canonical one, so the current aarch64_simd_vec_unpack<su>_lo_<mode>
>>> pattern hard-codes unsimplified rtl.
>>> 
>>
>> Ok, but then it's very weird to have _hi still.  The reason for these 
>> indirections
>> seems to be to model the hi/lo split for RTL simplifications. 
>>
>> The aarch64_simd_vec_unpack<su>_lo models a vec_select as well.  This RTL is
>> different from the one it gets rewritten to (dropping the vec_select).
>>
>> So dropping it means we lose this, and I guess maybe fold-rtx does it now?
>> It's not immediately clear to me that dropping the additional RTL for recog
>> is beneficially.
>
> The principle is that, say:
>
>   (vec_select:V2SI (reg:V2DI R) (parallel [(const_int 0) (const_int 1)]))
>
> is (for little-endian) equivalent to:
>
>   (subreg:V2SI (reg:V2DI R) 0)

Sigh, of course I meant V4SI rather than V2DI in the above :)

> and similarly for the equivalent big-endian pair.  Simplification rules
> are now supposed to ensure that only the second (simpler) form is generated
> by target-independent code.  We should fix any cases where that doesn't
> happen, since it would be a missed optimisation for any instructions
> that take (in this case) V2SI inputs.
>
> There's no equivalent simplification for _hi because it isn't possible
> to refer directly to the upper 64 bits of a 128-bit register using subregs.
>
> Thanks,
> Richard

Reply via email to