Christophe Lyon <christophe.l...@linaro.org> writes:
> On Tue, 8 Jun 2021 at 14:10, Richard Sandiford
> <richard.sandif...@arm.com> wrote:
>>
>> Christophe Lyon <christophe.l...@linaro.org> writes:
>> > This patch adds vec_unpack<US>_hi_<mode>, vec_unpack<US>_lo_<mode>,
>> > vec_pack_trunc_<mode> patterns for MVE.
>> >
>> > It does so by moving the unpack patterns from neon.md to
>> > vec-common.md, while adding them support for MVE. The pack expander is
>> > derived from the Neon one (which in turn is renamed into
>> > neon_quad_vec_pack_trunc_<mode>).
>> >
>> > The patch introduces mve_vec_pack_trunc_<mode> to avoid the need for a
>> > zero-initialized temporary, which is needed if the
>> > vec_pack_trunc_<mode> expander calls @mve_vmovn[bt]q_<supf><mode>
>> > instead.
>> >
>> > With this patch, we can now vectorize the 16 and 8-bit versions of
>> > vclz and vshl, although the generated code could still be improved.
>> > For test_clz_s16, we now generate
>> >         vldrh.16        q3, [r1]
>> >         vmovlb.s16   q2, q3
>> >         vmovlt.s16   q3, q3
>> >         vclz.i32  q2, q2
>> >         vclz.i32  q3, q3
>> >         vmovnb.i32      q1, q2
>> >         vmovnt.i32      q1, q3
>> >         vstrh.16        q1, [r0]
>> > which could be improved to
>> >         vldrh.16        q3, [r1]
>> >       vclz.i16        q1, q3
>> >         vstrh.16        q1, [r0]
>> > if we could avoid the need for unpack/pack steps.
>>
>> Yeah, there was a PR about fixing this for popcount.  I guess the same
>> approach would apply here too.
>>
>> > For reference, clang-12 generates:
>> >       vldrh.s32       q0, [r1]
>> >       vldrh.s32       q1, [r1, #8]
>> >       vclz.i32        q0, q0
>> >       vstrh.32        q0, [r0]
>> >       vclz.i32        q0, q1
>> >       vstrh.32        q0, [r0, #8]
>> >
>> > 2021-06-03  Christophe Lyon  <christophe.l...@linaro.org>
>> >
>> >       gcc/
>> >       * config/arm/mve.md (mve_vmovltq_<supf><mode>): Prefix with '@'.
>> >       (mve_vmovlbq_<supf><mode>): Likewise.
>> >       (mve_vmovnbq_<supf><mode>): Likewise.
>> >       (mve_vmovntq_<supf><mode>): Likewise.
>> >       (@mve_vec_pack_trunc_<mode>): New pattern.
>> >       * config/arm/neon.md (vec_unpack<US>_hi_<mode>): Move to
>> >       vec-common.md.
>> >       (vec_unpack<US>_lo_<mode>): Likewise.
>> >       (vec_pack_trunc_<mode>): Rename to
>> >       neon_quad_vec_pack_trunc_<mode>.
>> >       * config/arm/vec-common.md (vec_unpack<US>_hi_<mode>): New
>> >       pattern.
>> >       (vec_unpack<US>_lo_<mode>): New.
>> >       (vec_pack_trunc_<mode>): New.
>> >
>> >       gcc/testsuite/
>> >       * gcc.target/arm/simd/mve-vclz.c: Update expected results.
>> >       * gcc.target/arm/simd/mve-vshl.c: Likewise.
>> > ---
>> >  gcc/config/arm/mve.md                        | 20 ++++-
>> >  gcc/config/arm/neon.md                       | 39 +--------
>> >  gcc/config/arm/vec-common.md                 | 89 ++++++++++++++++++++
>> >  gcc/testsuite/gcc.target/arm/simd/mve-vclz.c |  7 +-
>> >  gcc/testsuite/gcc.target/arm/simd/mve-vshl.c |  5 +-
>> >  5 files changed, 114 insertions(+), 46 deletions(-)
>> >
>> > diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
>> > index 99e46d0bc69..b18292c07d3 100644
>> > --- a/gcc/config/arm/mve.md
>> > +++ b/gcc/config/arm/mve.md
>> > @@ -510,7 +510,7 @@ (define_insn "mve_vrev32q_<supf><mode>"
>> >  ;;
>> >  ;; [vmovltq_u, vmovltq_s])
>> >  ;;
>> > -(define_insn "mve_vmovltq_<supf><mode>"
>> > +(define_insn "@mve_vmovltq_<supf><mode>"
>> >    [
>> >     (set (match_operand:<V_double_width> 0 "s_register_operand" "=w")
>> >       (unspec:<V_double_width> [(match_operand:MVE_3 1 
>> > "s_register_operand" "w")]
>> > @@ -524,7 +524,7 @@ (define_insn "mve_vmovltq_<supf><mode>"
>> >  ;;
>> >  ;; [vmovlbq_s, vmovlbq_u])
>> >  ;;
>> > -(define_insn "mve_vmovlbq_<supf><mode>"
>> > +(define_insn "@mve_vmovlbq_<supf><mode>"
>> >    [
>> >     (set (match_operand:<V_double_width> 0 "s_register_operand" "=w")
>> >       (unspec:<V_double_width> [(match_operand:MVE_3 1 
>> > "s_register_operand" "w")]
>> > @@ -2187,7 +2187,7 @@ (define_insn "mve_vmlsldavxq_s<mode>"
>> >  ;;
>> >  ;; [vmovnbq_u, vmovnbq_s])
>> >  ;;
>> > -(define_insn "mve_vmovnbq_<supf><mode>"
>> > +(define_insn "@mve_vmovnbq_<supf><mode>"
>> >    [
>> >     (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w")
>> >       (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 
>> > "s_register_operand" "0")
>> > @@ -2202,7 +2202,7 @@ (define_insn "mve_vmovnbq_<supf><mode>"
>> >  ;;
>> >  ;; [vmovntq_s, vmovntq_u])
>> >  ;;
>> > -(define_insn "mve_vmovntq_<supf><mode>"
>> > +(define_insn "@mve_vmovntq_<supf><mode>"
>> >    [
>> >     (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w")
>> >       (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 
>> > "s_register_operand" "0")
>> > @@ -2214,6 +2214,18 @@ (define_insn "mve_vmovntq_<supf><mode>"
>> >    [(set_attr "type" "mve_move")
>> >  ])
>> >
>> > +(define_insn "@mve_vec_pack_trunc_<mode>"
>> > + [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w")
>> > +       (vec_concat:<V_narrow_pack>
>> > +             (truncate:<V_narrow>
>> > +                     (match_operand:MVE_5 1 "register_operand" "w"))
>> > +             (truncate:<V_narrow>
>> > +                     (match_operand:MVE_5 2 "register_operand" "w"))))]
>> > + "TARGET_HAVE_MVE"
>> > + "vmovnb.i<V_sz_elem>        %q0, %q1\;vmovnt.i<V_sz_elem>   %q0, %q2"
>> > +  [(set_attr "type" "mve_move")]
>> > +)
>> > +
>>
>> I realise this is (like you say) based on the neon.md pattern,
>> but we should use separate vmovnb and vmovnt instructions instead
>> of putting two instructions into a single pattern.
>>
>> One specific advantage to using separate patterns is that it would
>> avoid the imprecision of the earlyclobber: the output only conflicts
>> with operand 1 and can be tied to operand 2.
>>
> I'm not sure to follow: with the current patch, for test_clz_s16 we generate:
>         vldrh.16        q3, [r1]
>         vmovlb.s16   q2, q3
>         vmovlt.s16   q3, q3
>         vclz.i32  q2, q2
>         vclz.i32  q3, q3
>         vmovnb.i32      q1, q2
>         vmovnt.i32      q1, q3
>         vstrh.16        q1, [r0]
>
> If I remove the define_insn "@mve_vec_pack_trunc_<mode>" pattern
> and update define_expand "vec_pack_trunc_<mode>" to call:
> emit_insn (gen_mve_vmovnbq (VMOVNBQ_S, <MODE>mode, operands[0],
> operands[0], operands[1]));
> emit_insn (gen_mve_vmovntq (VMOVNTQ_S, <MODE>mode, operands[0],
> operands[0], operands[2]));
> instead, we now generate:
>         vldrh.16        q3, [r1]
>         vmovlb.s16   q1, q3
>         vldr.64 d4, .L5
>         vldr.64 d5, .L5+8
>         vmovlt.s16   q3, q3
>         vclz.i32  q1, q1
>         vclz.i32  q3, q3
>         vmovnb.i32      q2, q1
>         vmovnt.i32      q2, q3
>         vstrh.16        q2, [r0]
>         bx      lr
> .L6:
>         .align  3
> .L5:
>         .short  0
>         .short  0
>         .short  0
>         .short  0
>         .short  0
>         .short  0
>         .short  0
>         .short  0
>
> That's what I tried to explain in the commit message.

Sorry, I missed that part of the message.  If we want the first
instruction to have an in/out operand whose initial value is undefined
then I think the best way would be to have a second pattern for vmovnb
(only) that acts like a unary operation.

FWIW, in the above, we shouldn't assign to operands[0] twice.
The result of the first operation needs to be a new temporary register.

Thanks,
Richard

Reply via email to