Re: [AArch64] Prefer LD1RQ for big-endian SVE

James Greenhalgh Mon, 29 Jan 2018 02:12:11 -0800

On Fri, Jan 26, 2018 at 01:54:42PM +0000, Richard Sandiford wrote:
> This patch deals with cases in which a CONST_VECTOR contains a
> repeating bit pattern that is wider than one element but narrower
> than 128 bits.  The current code:
> 
> * treats the repeating pattern as a single element
> * uses the associated LD1R to load and replicate it (such as LD1RD
>   for 64-bit patterns)
> * uses a subreg to cast the result back to the original vector type
> 
> The problem is that for big-endian targets, the final cast is
> effectively a form of element reverse.  E.g. say we're using LD1RD to load
> 16-bit elements, with h being the high parts and l being the low parts:
> 
>                                +-----+-----+-----+-----+-----+----
>                          lanes |  0  |  1  |  2  |  3  |  4  | ...
>                                +-----+-----+-----+-----+-----+----
>      memory              bytes |h0 l0 h1 l1 h2 l2 h3 l3 h0 l0 ....
>                                +----------------------------------
>                                  V  V  V  V  V  V  V  V
>                      ----------+-----------------------+
>     register         ....      |           0           |
>      after           ----------+-----------------------+  lsb
>      LD1RD           .... h3 l3 h0 l0 h1 l1 h2 l2 h3 l3|
>                      ----------------------------------+
> 
>                      ----+-----+-----+-----+-----+-----+
>     expected         ... |  4  |  3  |  2  |  1  |  0  |
>     register         ----+-----+-----+-----+-----+-----+  lsb
>     contents         .... h0 l0 h3 l3 h2 l2 h1 l1 h0 l0|
>                      ----------------------------------+
> 
> A later patch fixes the handling of general subregs to account
> for this, but it means that we need to do a REV instruction
> after the load.  It seems better to use LD1RQ[BHW] on a 128-bit
> pattern instead, since that gets the endianness right without
> a separate fixup instruction.
> 
> This is another step towards fixing sve/slp_* for aarch64_be.
> 
> Tested on aarch64_be-elf and aarch64-linux-gnu.  OK to install?


OK.

Thanks,
James

> 
> 2018-01-26  Richard Sandiford  <richard.sandif...@linaro.org>
> 
> gcc/
>       * config/aarch64/aarch64.c (aarch64_expand_sve_const_vector): Prefer
>       the TImode handling for big-endian targets.
> 
> gcc/testsuite/
>       * gcc.target/aarch64/sve/slp_2.c: Expect LD1RQ to be used instead
>       of LD1R[HWD] for multi-element constants on big-endian targets.
>       * gcc.target/aarch64/sve/slp_3.c: Likewise.
>       * gcc.target/aarch64/sve/slp_4.c: Likewise.
> 
> Index: gcc/config/aarch64/aarch64.c
> ===================================================================
> --- gcc/config/aarch64/aarch64.c      2018-01-26 13:49:00.071630173 +0000
> +++ gcc/config/aarch64/aarch64.c      2018-01-26 13:51:19.665760175 +0000
> @@ -2824,10 +2824,18 @@ aarch64_expand_sve_const_vector (rtx des
>        /* The constant is a repeating seqeuence of at least two elements,
>        where the repeating elements occupy no more than 128 bits.
>        Get an integer representation of the replicated value.  */
> -      unsigned int int_bits = GET_MODE_UNIT_BITSIZE (mode) * npatterns;
> -      gcc_assert (int_bits <= 128);
> -
> -      scalar_int_mode int_mode = int_mode_for_size (int_bits, 0).require ();
> +      scalar_int_mode int_mode;
> +      if (BYTES_BIG_ENDIAN)
> +     /* For now, always use LD1RQ to load the value on big-endian
> +        targets, since the handling of smaller integers includes a
> +        subreg that is semantically an element reverse.  */
> +     int_mode = TImode;
> +      else
> +     {
> +       unsigned int int_bits = GET_MODE_UNIT_BITSIZE (mode) * npatterns;
> +       gcc_assert (int_bits <= 128);
> +       int_mode = int_mode_for_size (int_bits, 0).require ();
> +     }
>        rtx int_value = simplify_gen_subreg (int_mode, src, mode, 0);
>        if (int_value
>         && aarch64_expand_sve_widened_duplicate (dest, int_mode, int_value))
> Index: gcc/testsuite/gcc.target/aarch64/sve/slp_2.c
> ===================================================================
> --- gcc/testsuite/gcc.target/aarch64/sve/slp_2.c      2018-01-26 
> 13:49:00.071630173 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve/slp_2.c      2018-01-26 
> 13:51:19.665760175 +0000
> @@ -29,9 +29,12 @@ #define TEST_ALL(T)                                \
>  
>  TEST_ALL (VEC_PERM)
>  
> -/* { dg-final { scan-assembler-times {\tld1rh\tz[0-9]+\.h, } 2 } } */
> -/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 3 } } */
> -/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 3 } } */
> +/* { dg-final { scan-assembler-times {\tld1rh\tz[0-9]+\.h, } 2 { target 
> aarch64_little_endian } } } */
> +/* { dg-final { scan-assembler-times {\tld1rqb\tz[0-9]+\.b, } 2 { target 
> aarch64_big_endian } } } */
> +/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 3 { target 
> aarch64_little_endian } } } */
> +/* { dg-final { scan-assembler-times {\tld1rqh\tz[0-9]+\.h, } 3 { target 
> aarch64_big_endian } } } */
> +/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 3 { target 
> aarch64_little_endian } } } */
> +/* { dg-final { scan-assembler-times {\tld1rqw\tz[0-9]+\.s, } 3 { target 
> aarch64_big_endian } } } */
>  /* { dg-final { scan-assembler-times {\tld1rqd\tz[0-9]+\.d, } 3 } } */
>  /* { dg-final { scan-assembler-not {\tzip1\t} } } */
>  /* { dg-final { scan-assembler-not {\tzip2\t} } } */
> Index: gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
> ===================================================================
> --- gcc/testsuite/gcc.target/aarch64/sve/slp_3.c      2018-01-26 
> 13:49:00.071630173 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve/slp_3.c      2018-01-26 
> 13:51:19.665760175 +0000
> @@ -32,9 +32,12 @@ #define TEST_ALL(T)                                \
>  TEST_ALL (VEC_PERM)
>  
>  /* 1 for each 8-bit type.  */
> -/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 2 } } */
> +/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 2 { target 
> aarch64_little_endian } } } */
> +/* { dg-final { scan-assembler-times {\tld1rqb\tz[0-9]+\.b, } 2 { target 
> aarch64_big_endian } } } */
>  /* 1 for each 16-bit type and 4 for double.  */
> -/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 7 } } */
> +/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 7 { target 
> aarch64_little_endian } } } */
> +/* { dg-final { scan-assembler-times {\tld1rqh\tz[0-9]+\.h, } 3 { target 
> aarch64_big_endian } } } */
> +/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 4 { target 
> aarch64_big_endian } } } */
>  /* 1 for each 32-bit type.  */
>  /* { dg-final { scan-assembler-times {\tld1rqw\tz[0-9]+\.s, } 3 } } */
>  /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #41\n} 2 } } */
> Index: gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
> ===================================================================
> --- gcc/testsuite/gcc.target/aarch64/sve/slp_4.c      2018-01-26 
> 13:49:00.072630137 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve/slp_4.c      2018-01-26 
> 13:51:19.665760175 +0000
> @@ -36,7 +36,9 @@ #define TEST_ALL(T)                         \
>  TEST_ALL (VEC_PERM)
>  
>  /* 1 for each 8-bit type, 4 for each 32-bit type and 8 for double.  */
> -/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 22 } } */
> +/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 22 { target 
> aarch64_little_endian } } } */
> +/* { dg-final { scan-assembler-times {\tld1rqb\tz[0-9]+\.b, } 2 { target 
> aarch64_big_endian } } } */
> +/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 20 { target 
> aarch64_big_endian } } } */
>  /* 1 for each 16-bit type.  */
>  /* { dg-final { scan-assembler-times {\tld1rqh\tz[0-9]\.h, } 3 } } */
>  /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #99\n} 2 } } */

Re: [AArch64] Prefer LD1RQ for big-endian SVE

Reply via email to