Christophe Lyon <christophe.l...@linaro.org> writes: > On Tue, 8 Jun 2021 at 14:10, Richard Sandiford > <richard.sandif...@arm.com> wrote: >> >> Christophe Lyon <christophe.l...@linaro.org> writes: >> > This patch adds vec_unpack<US>_hi_<mode>, vec_unpack<US>_lo_<mode>, >> > vec_pack_trunc_<mode> patterns for MVE. >> > >> > It does so by moving the unpack patterns from neon.md to >> > vec-common.md, while adding them support for MVE. The pack expander is >> > derived from the Neon one (which in turn is renamed into >> > neon_quad_vec_pack_trunc_<mode>). >> > >> > The patch introduces mve_vec_pack_trunc_<mode> to avoid the need for a >> > zero-initialized temporary, which is needed if the >> > vec_pack_trunc_<mode> expander calls @mve_vmovn[bt]q_<supf><mode> >> > instead. >> > >> > With this patch, we can now vectorize the 16 and 8-bit versions of >> > vclz and vshl, although the generated code could still be improved. >> > For test_clz_s16, we now generate >> > vldrh.16 q3, [r1] >> > vmovlb.s16 q2, q3 >> > vmovlt.s16 q3, q3 >> > vclz.i32 q2, q2 >> > vclz.i32 q3, q3 >> > vmovnb.i32 q1, q2 >> > vmovnt.i32 q1, q3 >> > vstrh.16 q1, [r0] >> > which could be improved to >> > vldrh.16 q3, [r1] >> > vclz.i16 q1, q3 >> > vstrh.16 q1, [r0] >> > if we could avoid the need for unpack/pack steps. >> >> Yeah, there was a PR about fixing this for popcount. I guess the same >> approach would apply here too. >> >> > For reference, clang-12 generates: >> > vldrh.s32 q0, [r1] >> > vldrh.s32 q1, [r1, #8] >> > vclz.i32 q0, q0 >> > vstrh.32 q0, [r0] >> > vclz.i32 q0, q1 >> > vstrh.32 q0, [r0, #8] >> > >> > 2021-06-03 Christophe Lyon <christophe.l...@linaro.org> >> > >> > gcc/ >> > * config/arm/mve.md (mve_vmovltq_<supf><mode>): Prefix with '@'. >> > (mve_vmovlbq_<supf><mode>): Likewise. >> > (mve_vmovnbq_<supf><mode>): Likewise. >> > (mve_vmovntq_<supf><mode>): Likewise. >> > (@mve_vec_pack_trunc_<mode>): New pattern. >> > * config/arm/neon.md (vec_unpack<US>_hi_<mode>): Move to >> > vec-common.md. >> > (vec_unpack<US>_lo_<mode>): Likewise. >> > (vec_pack_trunc_<mode>): Rename to >> > neon_quad_vec_pack_trunc_<mode>. >> > * config/arm/vec-common.md (vec_unpack<US>_hi_<mode>): New >> > pattern. >> > (vec_unpack<US>_lo_<mode>): New. >> > (vec_pack_trunc_<mode>): New. >> > >> > gcc/testsuite/ >> > * gcc.target/arm/simd/mve-vclz.c: Update expected results. >> > * gcc.target/arm/simd/mve-vshl.c: Likewise. >> > --- >> > gcc/config/arm/mve.md | 20 ++++- >> > gcc/config/arm/neon.md | 39 +-------- >> > gcc/config/arm/vec-common.md | 89 ++++++++++++++++++++ >> > gcc/testsuite/gcc.target/arm/simd/mve-vclz.c | 7 +- >> > gcc/testsuite/gcc.target/arm/simd/mve-vshl.c | 5 +- >> > 5 files changed, 114 insertions(+), 46 deletions(-) >> > >> > diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md >> > index 99e46d0bc69..b18292c07d3 100644 >> > --- a/gcc/config/arm/mve.md >> > +++ b/gcc/config/arm/mve.md >> > @@ -510,7 +510,7 @@ (define_insn "mve_vrev32q_<supf><mode>" >> > ;; >> > ;; [vmovltq_u, vmovltq_s]) >> > ;; >> > -(define_insn "mve_vmovltq_<supf><mode>" >> > +(define_insn "@mve_vmovltq_<supf><mode>" >> > [ >> > (set (match_operand:<V_double_width> 0 "s_register_operand" "=w") >> > (unspec:<V_double_width> [(match_operand:MVE_3 1 >> > "s_register_operand" "w")] >> > @@ -524,7 +524,7 @@ (define_insn "mve_vmovltq_<supf><mode>" >> > ;; >> > ;; [vmovlbq_s, vmovlbq_u]) >> > ;; >> > -(define_insn "mve_vmovlbq_<supf><mode>" >> > +(define_insn "@mve_vmovlbq_<supf><mode>" >> > [ >> > (set (match_operand:<V_double_width> 0 "s_register_operand" "=w") >> > (unspec:<V_double_width> [(match_operand:MVE_3 1 >> > "s_register_operand" "w")] >> > @@ -2187,7 +2187,7 @@ (define_insn "mve_vmlsldavxq_s<mode>" >> > ;; >> > ;; [vmovnbq_u, vmovnbq_s]) >> > ;; >> > -(define_insn "mve_vmovnbq_<supf><mode>" >> > +(define_insn "@mve_vmovnbq_<supf><mode>" >> > [ >> > (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w") >> > (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 >> > "s_register_operand" "0") >> > @@ -2202,7 +2202,7 @@ (define_insn "mve_vmovnbq_<supf><mode>" >> > ;; >> > ;; [vmovntq_s, vmovntq_u]) >> > ;; >> > -(define_insn "mve_vmovntq_<supf><mode>" >> > +(define_insn "@mve_vmovntq_<supf><mode>" >> > [ >> > (set (match_operand:<V_narrow_pack> 0 "s_register_operand" "=w") >> > (unspec:<V_narrow_pack> [(match_operand:<V_narrow_pack> 1 >> > "s_register_operand" "0") >> > @@ -2214,6 +2214,18 @@ (define_insn "mve_vmovntq_<supf><mode>" >> > [(set_attr "type" "mve_move") >> > ]) >> > >> > +(define_insn "@mve_vec_pack_trunc_<mode>" >> > + [(set (match_operand:<V_narrow_pack> 0 "register_operand" "=&w") >> > + (vec_concat:<V_narrow_pack> >> > + (truncate:<V_narrow> >> > + (match_operand:MVE_5 1 "register_operand" "w")) >> > + (truncate:<V_narrow> >> > + (match_operand:MVE_5 2 "register_operand" "w"))))] >> > + "TARGET_HAVE_MVE" >> > + "vmovnb.i<V_sz_elem> %q0, %q1\;vmovnt.i<V_sz_elem> %q0, %q2" >> > + [(set_attr "type" "mve_move")] >> > +) >> > + >> >> I realise this is (like you say) based on the neon.md pattern, >> but we should use separate vmovnb and vmovnt instructions instead >> of putting two instructions into a single pattern. >> >> One specific advantage to using separate patterns is that it would >> avoid the imprecision of the earlyclobber: the output only conflicts >> with operand 1 and can be tied to operand 2. >> > I'm not sure to follow: with the current patch, for test_clz_s16 we generate: > vldrh.16 q3, [r1] > vmovlb.s16 q2, q3 > vmovlt.s16 q3, q3 > vclz.i32 q2, q2 > vclz.i32 q3, q3 > vmovnb.i32 q1, q2 > vmovnt.i32 q1, q3 > vstrh.16 q1, [r0] > > If I remove the define_insn "@mve_vec_pack_trunc_<mode>" pattern > and update define_expand "vec_pack_trunc_<mode>" to call: > emit_insn (gen_mve_vmovnbq (VMOVNBQ_S, <MODE>mode, operands[0], > operands[0], operands[1])); > emit_insn (gen_mve_vmovntq (VMOVNTQ_S, <MODE>mode, operands[0], > operands[0], operands[2])); > instead, we now generate: > vldrh.16 q3, [r1] > vmovlb.s16 q1, q3 > vldr.64 d4, .L5 > vldr.64 d5, .L5+8 > vmovlt.s16 q3, q3 > vclz.i32 q1, q1 > vclz.i32 q3, q3 > vmovnb.i32 q2, q1 > vmovnt.i32 q2, q3 > vstrh.16 q2, [r0] > bx lr > .L6: > .align 3 > .L5: > .short 0 > .short 0 > .short 0 > .short 0 > .short 0 > .short 0 > .short 0 > .short 0 > > That's what I tried to explain in the commit message.
Sorry, I missed that part of the message. If we want the first instruction to have an in/out operand whose initial value is undefined then I think the best way would be to have a second pattern for vmovnb (only) that acts like a unary operation. FWIW, in the above, we shouldn't assign to operands[0] twice. The result of the first operation needs to be a new temporary register. Thanks, Richard