Re: [PATCH] aarch64: Optimize SVE extract last to Neon lane extract for 128-bit VLS.

Jennifer Schmitz Thu, 01 May 2025 09:23:42 -0700


> On 28 Apr 2025, at 16:29, Richard Sandiford <richard.sandif...@arm.com> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>> On 27 Apr 2025, at 08:42, Tamar Christina <tamar.christ...@arm.com> wrote:
>>> 
>>> External email: Use caution opening links or attachments
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: Richard Sandiford <richard.sandif...@arm.com>
>>>> Sent: Friday, April 25, 2025 4:45 PM
>>>> To: Jennifer Schmitz <jschm...@nvidia.com>
>>>> Cc: gcc-patches@gcc.gnu.org
>>>> Subject: Re: [PATCH] aarch64: Optimize SVE extract last to Neon lane 
>>>> extract for
>>>> 128-bit VLS.
>>>> 
>>>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>>>> For the test case
>>>>> int32_t foo (svint32_t x)
>>>>> {
>>>>> svbool_t pg = svpfalse ();
>>>>> return svlastb_s32 (pg, x);
>>>>> }
>>>>> compiled with -O3 -mcpu=grace -msve-vector-bits=128, GCC produced:
>>>>> foo:
>>>>>   pfalse  p3.b
>>>>>   lastb   w0, p3, z0.s
>>>>>   ret
>>>>> when it could use a Neon lane extract instead:
>>>>> foo:
>>>>>   umov    w0, v0.s[3]
>>>>>   ret
>>>>> 
>>>>> We implemented this optimization by guarding the emission of
>>>>> pfalse+lastb in the pattern vec_extract<mode><Vel> by
>>>>> known_gt (BYTES_PER_SVE_VECTOR, 16). Thus, for a last-extract operation
>>>>> in 128-bit VLS, the pattern *vec_extract<mode><Vel>_v128 is used instead.
>>>>> 
>>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no regression.
>>>>> OK for mainline?
>>>>> 
>>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>>>>> 
>>>>> gcc/
>>>>>   * config/aarch64/aarch64-sve.md (vec_extract<mode><Vel>):
>>>>>   Prevent the emission of pfalse+lastb for 128-bit VLS.
>>>>> 
>>>>> gcc/testsuite/
>>>>>   * gcc.target/aarch64/sve/extract_last_128.c: New test.
>>>> 
>>>> OK, thanks.
>>>> 
>>> 
>>> Hi Both,
>>> 
>>> If I may, how about viewing this transformation differently.
>>> 
>>> How about instead make pfalse + lastb become rev + umov, which is
>>> faster on all Arm micro-architectures for both VLS and VLA and then
>>> the optimization becomes lowering rev + umov for VL128 into umov[3]?
>>> 
>>> This would speed up this case for everything.
>>> 
>>> It would also be good to have a test that checks that
>>> 
>>> #include <arm_sve.h>
>>> 
>>> int foo (svint32_t a)
>>> {
>>>   return svrev (a)[0];
>>> }
>>> 
>>> Works. At the moment, with -msve-vector-bits=128 GCC transforms
>>> this back into lastb, which is a de-optimization. Your current patch does 
>>> fix it
>>> so would be good to test this too.
>> Hi Richard and Tamar,
>> thanks for the feedback.
>> Tamar’s suggestion sounds like a more general solution.
>> I think we could either change define_expand "vec_extract<mode><Vel>”
>> to not emit pfalse+lastb in case the last element is extracted, but emit 
>> rev+umov instead,
>> or we could add a define_insn_and_split to change pfalse+lastb to rev+umov 
>> (and just umov for VL128).
> 
> I think we should do it in the expand.  But ISTM that there are
> two separate things here:
> 
> (1) We shouldn't use the VLA expansion for VLS cases if the VLS case
>    can do something simpler.
> 
> (2) We should improve the expansion based on what uarches have actually
>    implemented.  (The current code predates h/w.)
> 
> I think we want (1) even if (2) involves using REV+UMOV, since for
> (1) we still want a single UMOV for the top of the range.  And we'd
> still want the tests in the patch.
> 
> So IMO, from a staging perspective, it makes sense to do (1) first
> and treat (2) as a follow-on.
> 
> The REV approach isn't restricted to the last element.  It should be able
> to handle:
> 
>  int8_t foo(svint8_t x) { return x[svcntb()-2]; }
> 
> as well.  In other words, if:
> 
>   auto idx = GET_MODE_NUNITS (mode) - val - 1;
> 
> is a constant in the range [0, 255], we can do a reverse and then
> fall through to the normal CONST_INT expansion with idx as the index.
> 
> But I suppose that does mean that, rather than:
> 
>        && known_gt (BYTES_PER_SVE_VECTOR, 16))
> 
> we should instead check:
> 
>        && !val.is_constant ()
> 
> For 256-bit SVE we should use DUP instead of PLAST+LASTB or REV+UMOV.
I adjusted the patch to use && !val.is_constant () as check, such that the 
extraction
of the last element is matched by other patterns for VLS of all vector widths.
256-bit and 512-bit VLS are now matched by *vec_extract<mode><Vel>_dup
and 1024-bit and 2048-bit by *vec_extract<mode><Vel>_ext.
There were existing tests for 256-bit, 512-bit, 1024-bit and 2048-bit that 
could be adjusted.
128-bit VLS is covered by the new test in the patch. Let me know if I shall 
change it to be
more analogous to the other extract_*.c tests.


About implementing point (2): Do we only want the REV approach you outlined for 
VLA,
now that VLS is redirected to other patterns?

Thanks,
Jennifer

For the test case
int32_t foo (svint32_t x)
{
  svbool_t pg = svpfalse ();
  return svlastb_s32 (pg, x);
}
compiled with -O3 -mcpu=grace -msve-vector-bits=128, GCC produced:
foo:
        pfalse  p3.b
        lastb   w0, p3, z0.s
        ret
when it could use a Neon lane extract instead:
foo:
        umov    w0, v0.s[3]
        ret

Similar optimizations can be made for VLS with other vector widths.

We implemented this optimization by guarding the emission of
pfalse+lastb in the pattern vec_extract<mode><Vel> by
!val.is_constant ().
Thus, for last-extract operations with VLS, the patterns
*vec_extract<mode><Vel>_v128, *vec_extract<mode><Vel>_dup, or
*vec_extract<mode><Vel>_ext are used instead.
We added tests for 128-bit VLS and adjusted the tests for the other vector
widths.

The patch was bootstrapped and tested on aarch64-linux-gnu, no regression.
OK for mainline?

Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>

gcc/
        * config/aarch64/aarch64-sve.md (vec_extract<mode><Vel>):
        Prevent the emission of pfalse+lastb for VLS.

gcc/testsuite/
        * gcc.target/aarch64/sve/extract_last_128.c: New test.
        * gcc.target/aarch64/sve/extract_1.c: Adjust expected outcome.
        * gcc.target/aarch64/sve/extract_2.c: Likewise.
        * gcc.target/aarch64/sve/extract_3.c: Likewise.
        * gcc.target/aarch64/sve/extract_4.c: Likewise.
---
 gcc/config/aarch64/aarch64-sve.md             |  7 ++--
 .../gcc.target/aarch64/sve/extract_1.c        | 22 +++++--------
 .../gcc.target/aarch64/sve/extract_2.c        | 23 ++++++-------
 .../gcc.target/aarch64/sve/extract_3.c        | 23 ++++++-------
 .../gcc.target/aarch64/sve/extract_4.c        | 23 ++++++-------
 .../gcc.target/aarch64/sve/extract_last_128.c | 33 +++++++++++++++++++
 6 files changed, 76 insertions(+), 55 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/extract_last_128.c

diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index d4af3706294..7bf12ff25cc 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -2969,10 +2969,11 @@
   {
     poly_int64 val;
     if (poly_int_rtx_p (operands[2], &val)
-       && known_eq (val, GET_MODE_NUNITS (<MODE>mode) - 1))
+       && known_eq (val, GET_MODE_NUNITS (<MODE>mode) - 1)
+       && !val.is_constant ())
       {
-       /* The last element can be extracted with a LASTB and a false
-          predicate.  */
+       /* For VLA, extract the last element with a LASTB and a false
+          predicate. */
        rtx sel = aarch64_pfalse_reg (<VPRED>mode);
        emit_insn (gen_extract_last_<mode> (operands[0], sel, operands[1]));
        DONE;
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/extract_1.c 
b/gcc/testsuite/gcc.target/aarch64/sve/extract_1.c
index 5d5edf26c19..92ae81cc4e3 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/extract_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/extract_1.c
@@ -56,40 +56,36 @@ typedef _Float16 vnx8hf __attribute__((vector_size (32)));
 
 TEST_ALL (EXTRACT)
 
-/* { dg-final { scan-assembler-times {\tfmov\tx[0-9]+, d[0-9]+\n} 2 { target 
aarch64_little_endian } } } */
-/* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[0\]\n} 1 { 
target aarch64_big_endian } } } */
+/* { dg-final { scan-assembler-times {\tfmov\tx[0-9]+, d[0-9]+\n} 3 { target 
aarch64_little_endian } } } */
+/* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[0\]\n} 2 { 
target aarch64_big_endian } } } */
 /* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[1\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-not {\tdup\td[0-9]+, v[0-9]+\.d\[0\]\n} } } */
 /* { dg-final { scan-assembler-times {\tdup\td[0-9]+, v[0-9]+\.d\[1\]\n} 1 } } 
*/
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.d, z[0-9]+\.d\[2\]\n} 2 
} } */
-/* { dg-final { scan-assembler-times {\tlastb\tx[0-9]+, p[0-7], z[0-9]+\.d\n} 
1 } } */
-/* { dg-final { scan-assembler-times {\tlastb\td[0-9]+, p[0-7], z[0-9]+\.d\n} 
1 } } */
+/* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.d, z[0-9]+\.d\[3\]\n} 2 
} } */
 
-/* { dg-final { scan-assembler-times {\tfmov\tw[0-9]+, s[0-9]+\n} 2 { target 
aarch64_little_endian } } } */
-/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[0\]\n} 1 { 
target aarch64_big_endian } } } */
+/* { dg-final { scan-assembler-times {\tfmov\tw[0-9]+, s[0-9]+\n} 3 { target 
aarch64_little_endian } } } */
+/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[0\]\n} 2 { 
target aarch64_big_endian } } } */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[1\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[3\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-not {\tdup\ts[0-9]+, v[0-9]+\.s\[0\]\n} } } */
 /* { dg-final { scan-assembler-times {\tdup\ts[0-9]+, v[0-9]+\.s\[1\]\n} 1 } } 
*/
 /* { dg-final { scan-assembler-times {\tdup\ts[0-9]+, v[0-9]+\.s\[3\]\n} 1 } } 
*/
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.s, z[0-9]+\.s\[4\]\n} 2 
} } */
-/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.s\n} 
1 } } */
-/* { dg-final { scan-assembler-times {\tlastb\ts[0-9]+, p[0-7], z[0-9]+\.s\n} 
1 } } */
+/* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.s, z[0-9]+\.s\[7\]\n} 2 
} } */
 
 /* Also used to move the result of a non-Advanced SIMD extract.  */
-/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[0\]\n} 2 } 
} */
+/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[0\]\n} 3 } 
} */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[1\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[7\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-not {\tdup\th[0-9]+, v[0-9]+\.h\[0\]\n} } } */
 /* { dg-final { scan-assembler-times {\tdup\th[0-9]+, v[0-9]+\.h\[1\]\n} 1 } } 
*/
 /* { dg-final { scan-assembler-times {\tdup\th[0-9]+, v[0-9]+\.h\[7\]\n} 1 } } 
*/
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.h, z[0-9]+\.h\[8\]\n} 2 
} } */
-/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.h\n} 
1 } } */
-/* { dg-final { scan-assembler-times {\tlastb\th[0-9]+, p[0-7], z[0-9]+\.h\n} 
1 } } */
+/* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.h, z[0-9]+\.h\[15\]\n} 2 
} } */
 
 /* Also used to move the result of a non-Advanced SIMD extract.  */
-/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[0\]\n} 2 } 
} */
+/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[0\]\n} 3 } 
} */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[1\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[15\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.b, z[0-9]+\.b\[16\]\n} 1 
} } */
-/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.b\n} 
1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/extract_2.c 
b/gcc/testsuite/gcc.target/aarch64/sve/extract_2.c
index 0e6ec836228..a3886b2de22 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/extract_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/extract_2.c
@@ -56,40 +56,37 @@ typedef _Float16 vnx16hf __attribute__((vector_size (64)));
 
 TEST_ALL (EXTRACT)
 
-/* { dg-final { scan-assembler-times {\tfmov\tx[0-9]+, d[0-9]+\n} 2 { target 
aarch64_little_endian } } } */
-/* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[0\]\n} 1 { 
target aarch64_big_endian } } } */
+/* { dg-final { scan-assembler-times {\tfmov\tx[0-9]+, d[0-9]+\n} 3 { target 
aarch64_little_endian } } } */
+/* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[0\]\n} 2 { 
target aarch64_big_endian } } } */
 /* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[1\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-not {\tdup\td[0-9]+, v[0-9]+\.d\[0\]\n} } } */
 /* { dg-final { scan-assembler-times {\tdup\td[0-9]+, v[0-9]+\.d\[1\]\n} 1 } } 
*/
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.d, z[0-9]+\.d\[2\]\n} 2 
} } */
-/* { dg-final { scan-assembler-times {\tlastb\tx[0-9]+, p[0-7], z[0-9]+\.d\n} 
1 } } */
-/* { dg-final { scan-assembler-times {\tlastb\td[0-9]+, p[0-7], z[0-9]+\.d\n} 
1 } } */
+/* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.d, z[0-9]+\.d\[7\]\n} 2 
} } */
 
-/* { dg-final { scan-assembler-times {\tfmov\tw[0-9]+, s[0-9]+\n} 2 { target 
aarch64_little_endian } } } */
-/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[0\]\n} 1 { 
target aarch64_big_endian } } } */
+/* { dg-final { scan-assembler-times {\tfmov\tw[0-9]+, s[0-9]+\n} 3 { target 
aarch64_little_endian } } } */
+/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[0\]\n} 2 { 
target aarch64_big_endian } } } */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[1\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[3\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-not {\tdup\ts[0-9]+, v[0-9]+\.s\[0\]\n} } } */
 /* { dg-final { scan-assembler-times {\tdup\ts[0-9]+, v[0-9]+\.s\[1\]\n} 1 } } 
*/
 /* { dg-final { scan-assembler-times {\tdup\ts[0-9]+, v[0-9]+\.s\[3\]\n} 1 } } 
*/
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.s, z[0-9]+\.s\[4\]\n} 2 
} } */
-/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.s\n} 
1 } } */
-/* { dg-final { scan-assembler-times {\tlastb\ts[0-9]+, p[0-7], z[0-9]+\.s\n} 
1 } } */
+/* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.s, z[0-9]+\.s\[15\]\n} 2 
} } */
 
 /* Also used to move the result of a non-Advanced SIMD extract.  */
-/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[0\]\n} 2 } 
} */
+/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[0\]\n} 3 } 
} */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[1\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[7\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-not {\tdup\th[0-9]+, v[0-9]+\.h\[0\]\n} } } */
 /* { dg-final { scan-assembler-times {\tdup\th[0-9]+, v[0-9]+\.h\[1\]\n} 1 } } 
*/
 /* { dg-final { scan-assembler-times {\tdup\th[0-9]+, v[0-9]+\.h\[7\]\n} 1 } } 
*/
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.h, z[0-9]+\.h\[8\]\n} 2 
} } */
-/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.h\n} 
1 } } */
-/* { dg-final { scan-assembler-times {\tlastb\th[0-9]+, p[0-7], z[0-9]+\.h\n} 
1 } } */
+/* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.h, z[0-9]+\.h\[31\]\n} 2 
} } */
 
 /* Also used to move the result of a non-Advanced SIMD extract.  */
-/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[0\]\n} 2 } 
} */
+/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[0\]\n} 3 } 
} */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[1\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[15\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.b, z[0-9]+\.b\[16\]\n} 1 
} } */
-/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.b\n} 
1 } } */
+/* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.b, z[0-9]+\.b\[63\]\n} 1 
} } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/extract_3.c 
b/gcc/testsuite/gcc.target/aarch64/sve/extract_3.c
index 0d7a2fa2527..c22b8a952de 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/extract_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/extract_3.c
@@ -77,18 +77,16 @@ typedef _Float16 vnx32hf __attribute__((vector_size (128)));
 
 TEST_ALL (EXTRACT)
 
-/* { dg-final { scan-assembler-times {\tfmov\tx[0-9]+, d[0-9]+\n} 5 { target 
aarch64_little_endian } } } */
-/* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[0\]\n} 1 { 
target aarch64_big_endian } } } */
+/* { dg-final { scan-assembler-times {\tfmov\tx[0-9]+, d[0-9]+\n} 6 { target 
aarch64_little_endian } } } */
+/* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[0\]\n} 2 { 
target aarch64_big_endian } } } */
 /* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[1\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-not {\tdup\td[0-9]+, v[0-9]+\.d\[0\]\n} } } */
 /* { dg-final { scan-assembler-times {\tdup\td[0-9]+, v[0-9]+\.d\[1\]\n} 1 } } 
*/
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.d, z[0-9]+\.d\[2\]\n} 2 
} } */
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.d, z[0-9]+\.d\[7\]\n} 2 
} } */
-/* { dg-final { scan-assembler-times {\tlastb\tx[0-9]+, p[0-7], z[0-9]+\.d\n} 
1 } } */
-/* { dg-final { scan-assembler-times {\tlastb\td[0-9]+, p[0-7], z[0-9]+\.d\n} 
1 } } */
 
-/* { dg-final { scan-assembler-times {\tfmov\tw[0-9]+, s[0-9]+\n} 5 { target 
aarch64_little_endian } } } */
-/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[0\]\n} 1 { 
target aarch64_big_endian } } } */
+/* { dg-final { scan-assembler-times {\tfmov\tw[0-9]+, s[0-9]+\n} 6 { target 
aarch64_little_endian } } } */
+/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[0\]\n} 2 { 
target aarch64_big_endian } } } */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[1\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[3\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-not {\tdup\ts[0-9]+, v[0-9]+\.s\[0\]\n} } } */
@@ -96,11 +94,9 @@ TEST_ALL (EXTRACT)
 /* { dg-final { scan-assembler-times {\tdup\ts[0-9]+, v[0-9]+\.s\[3\]\n} 1 } } 
*/
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.s, z[0-9]+\.s\[4\]\n} 2 
} } */
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.s, z[0-9]+\.s\[15\]\n} 2 
} } */
-/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.s\n} 
1 } } */
-/* { dg-final { scan-assembler-times {\tlastb\ts[0-9]+, p[0-7], z[0-9]+\.s\n} 
1 } } */
 
 /* Also used to move the result of a non-Advanced SIMD extract.  */
-/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[0\]\n} 5 } 
} */
+/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[0\]\n} 6 } 
} */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[1\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[7\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-not {\tdup\th[0-9]+, v[0-9]+\.h\[0\]\n} } } */
@@ -108,19 +104,20 @@ TEST_ALL (EXTRACT)
 /* { dg-final { scan-assembler-times {\tdup\th[0-9]+, v[0-9]+\.h\[7\]\n} 1 } } 
*/
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.h, z[0-9]+\.h\[8\]\n} 2 
} } */
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.h, z[0-9]+\.h\[31\]\n} 2 
} } */
-/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.h\n} 
1 } } */
-/* { dg-final { scan-assembler-times {\tlastb\th[0-9]+, p[0-7], z[0-9]+\.h\n} 
1 } } */
 
 /* Also used to move the result of a non-Advanced SIMD extract.  */
-/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[0\]\n} 5 } 
} */
+/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[0\]\n} 6 } 
} */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[1\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[15\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.b, z[0-9]+\.b\[16\]\n} 1 
} } */
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.b, z[0-9]+\.b\[63\]\n} 1 
} } */
-/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.b\n} 
1 } } */
 
 /* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b, #64\n} 7 } } */
 /* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b, #72\n} 2 } } */
 /* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b, #84\n} 2 } } */
 /* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b, #94\n} 2 } } */
 /* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b, #100\n} 1 } } */
+/* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b, #120\n} 2 } } */
+/* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b, #124\n} 2 } } */
+/* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b, #126\n} 2 } } */
+/* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b, #127\n} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/extract_4.c 
b/gcc/testsuite/gcc.target/aarch64/sve/extract_4.c
index a706291023f..0fa91757dfd 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/extract_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/extract_4.c
@@ -84,18 +84,16 @@ typedef _Float16 v128hf __attribute__((vector_size (256)));
 
 TEST_ALL (EXTRACT)
 
-/* { dg-final { scan-assembler-times {\tfmov\tx[0-9]+, d[0-9]+\n} 6 { target 
aarch64_little_endian } } } */
-/* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[0\]\n} 1 { 
target aarch64_big_endian } } } */
+/* { dg-final { scan-assembler-times {\tfmov\tx[0-9]+, d[0-9]+\n} 7 { target 
aarch64_little_endian } } } */
+/* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[0\]\n} 2 { 
target aarch64_big_endian } } } */
 /* { dg-final { scan-assembler-times {\tumov\tx[0-9]+, v[0-9]+\.d\[1\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-not {\tdup\td[0-9]+, v[0-9]+\.d\[0\]\n} } } */
 /* { dg-final { scan-assembler-times {\tdup\td[0-9]+, v[0-9]+\.d\[1\]\n} 1 } } 
*/
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.d, z[0-9]+\.d\[2\]\n} 2 
} } */
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.d, z[0-9]+\.d\[7\]\n} 2 
} } */
-/* { dg-final { scan-assembler-times {\tlastb\tx[0-9]+, p[0-7], z[0-9]+\.d\n} 
1 } } */
-/* { dg-final { scan-assembler-times {\tlastb\td[0-9]+, p[0-7], z[0-9]+\.d\n} 
1 } } */
 
-/* { dg-final { scan-assembler-times {\tfmov\tw[0-9]+, s[0-9]+\n} 6 { target 
aarch64_little_endian } } } */
-/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[0\]\n} 1 { 
target aarch64_big_endian } } } */
+/* { dg-final { scan-assembler-times {\tfmov\tw[0-9]+, s[0-9]+\n} 7 { target 
aarch64_little_endian } } } */
+/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[0\]\n} 2 { 
target aarch64_big_endian } } } */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[1\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.s\[3\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-not {\tdup\ts[0-9]+, v[0-9]+\.s\[0\]\n} } } */
@@ -103,11 +101,9 @@ TEST_ALL (EXTRACT)
 /* { dg-final { scan-assembler-times {\tdup\ts[0-9]+, v[0-9]+\.s\[3\]\n} 1 } } 
*/
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.s, z[0-9]+\.s\[4\]\n} 2 
} } */
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.s, z[0-9]+\.s\[15\]\n} 2 
} } */
-/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.s\n} 
1 } } */
-/* { dg-final { scan-assembler-times {\tlastb\ts[0-9]+, p[0-7], z[0-9]+\.s\n} 
1 } } */
 
 /* Also used to move the result of a non-Advanced SIMD extract.  */
-/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[0\]\n} 6 } 
} */
+/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[0\]\n} 7 } 
} */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[1\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h\[7\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-not {\tdup\th[0-9]+, v[0-9]+\.h\[0\]\n} } } */
@@ -115,16 +111,13 @@ TEST_ALL (EXTRACT)
 /* { dg-final { scan-assembler-times {\tdup\th[0-9]+, v[0-9]+\.h\[7\]\n} 1 } } 
*/
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.h, z[0-9]+\.h\[8\]\n} 2 
} } */
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.h, z[0-9]+\.h\[31\]\n} 2 
} } */
-/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.h\n} 
1 } } */
-/* { dg-final { scan-assembler-times {\tlastb\th[0-9]+, p[0-7], z[0-9]+\.h\n} 
1 } } */
 
 /* Also used to move the result of a non-Advanced SIMD extract.  */
-/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[0\]\n} 6 } 
} */
+/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[0\]\n} 7 } 
} */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[1\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.b\[15\]\n} 1 } 
} */
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.b, z[0-9]+\.b\[16\]\n} 1 
} } */
 /* { dg-final { scan-assembler-times {\tdup\tz[0-9]+\.b, z[0-9]+\.b\[63\]\n} 1 
} } */
-/* { dg-final { scan-assembler-times {\tlastb\tw[0-9]+, p[0-7], z[0-9]+\.b\n} 
1 } } */
 
 /* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b, #64\n} 7 } } */
 /* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b, #72\n} 2 } } */
@@ -135,3 +128,7 @@ TEST_ALL (EXTRACT)
 /* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b, #124\n} 2 } } */
 /* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b, #126\n} 2 } } */
 /* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b, #127\n} 1 } } */
+/* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b, #248\n} 2 } } */
+/* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b, #252\n} 2 } } */
+/* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b, #254\n} 2 } } */
+/* { dg-final { scan-assembler-times {\text\tz[0-9]+\.b, z[0-9]+\.b, 
z[0-9]+\.b, #255\n} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/extract_last_128.c 
b/gcc/testsuite/gcc.target/aarch64/sve/extract_last_128.c
new file mode 100644
index 00000000000..2684fb69756
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/extract_last_128.c
@@ -0,0 +1,33 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -msve-vector-bits=128" } */
+
+#include <arm_sve.h>
+
+#define TEST(TYPE, TY)                 \
+  TYPE extract_last_##TY (sv##TYPE x)  \
+  {                                    \
+    svbool_t pg = svpfalse ();         \
+    return svlastb_##TY (pg, x);       \
+  }
+
+TEST(bfloat16_t, bf16)
+TEST(float16_t, f16)
+TEST(float32_t, f32)
+TEST(float64_t, f64)
+TEST(int8_t, s8)
+TEST(int16_t, s16)
+TEST(int32_t, s32)
+TEST(int64_t, s64)
+TEST(uint8_t, u8)
+TEST(uint16_t, u16)
+TEST(uint32_t, u32)
+TEST(uint64_t, u64)
+
+/* { dg-final { scan-assembler-times {\tdup\th0, v0\.h\[7\]} 2 } } */
+/* { dg-final { scan-assembler-times {\tdup\ts0, v0\.s\[3\]} 1 } } */
+/* { dg-final { scan-assembler-times {\tdup\td0, v0\.d\[1\]} 1 } } */
+/* { dg-final { scan-assembler-times {\tumov\tw0, v0\.h\[7\]} 2 } } */
+/* { dg-final { scan-assembler-times {\tumov\tw0, v0\.b\[15\]} 2 } } */
+/* { dg-final { scan-assembler-times {\tumov\tw0, v0\.s\[3\]} 2 } } */
+/* { dg-final { scan-assembler-times {\tumov\tx0, v0\.d\[1\]} 2 } } */
+/* { dg-final { scan-assembler-not "lastb" } } */
\ No newline at end of file
-- 
2.34.1
> 
> Thanks,
> Richard

smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH] aarch64: Optimize SVE extract last to Neon lane extract for 128-bit VLS.

Reply via email to