On Thu, Feb 25, 2016 at 09:25:45AM +0000, Kyrill Tkachov wrote:
> Hi all,
> 
> In this wrong-code PR we get bad code when synthesising a TImode right shift
> by variable amount using DImode shifts during expand.
> 
> The expand_double_word_shift function expands two paths: one where the
> variable amount is greater than GET_MODE_BITSIZE (DImode) (word_mode for 
> aarch64) at runtime
> and one where it's less than that, and performs a conditional select between 
> the two in the end.
> 
> The second path is the one that goes wrong here.
> The algorithm it uses for calculating a TImode shift when the variable shift 
> amount is <64 is:
> 
> 
>   ------DImode-----   ------DImode----- ------DImode-----   ------DImode-----
> { |__ target-hi ___| |___ target-lo ___| } = { |___ source-hi ___| |___ 
> source-lo ___| } >> var_amnt
> 
> 1) carry = source-hi << 1
> 2) tmp = ~var_amnt // either that or BITS_PER_WORD - 1 - var_amnt depending 
> on shift_truncation_mask
> 3) carry = carry << tmp // net effect is that carry is source-hi shifted left 
> by BITS_PER_WORD - var_amnt
> 4) target-lo = source-lo >>u var_amnt //unsigned shift.
> 5) target-lo = target-lo | carry
> 6) target-hi = source-hi >> var_amnt
> 
> where the two DImode words source-hi and source-lo are the two words of the
> source TImode value and var_amnt is the register holding the variable shift
> amount. This is performed by the expand_subword_shift function in optabs.c.
> 
> Step 2) is valid only if the target truncates the shift amount by the width
> of the type its shifting, that is SHIFT_COUNT_TRUNCATED is true and
> TARGET_SHIFT_TRUNCATION_MASK is 63 in this case.
> 
> Step 3) is the one that goes wrong.  On aarch64 a left shift is usually
> implemented using the LSL instruction but we also have alternatives that can
> use the SIMD registers and the USHL instruction.  In this case we end up
> using the USHL instruction. However, although the USHL instruction does
> a DImode shift, it doesn't truncate the shift amount by 64, but rather by
> 255. Furthermore, the shift amount is interpreted as a 2's complement signed
> integer and if it's negative performs a right shift.  This is in contrast
> with the normal LSL instruction which just performs an unsigned shift by an
> amount truncated by 64.
> 
> Now, on aarch64 SHIFT_COUNT_TRUNCATED is defined as !TARGET_SIMD, so we don't
> assume shift amounts are truncated unless we know we can only ever use the
> LSL instructions for variable shifts.
> 
> However, we still return 63 as the TARGET_SHIFT_TRUNCATION_MASK for DImode,
> so expand_subword_shift assumes that since the mask is non-zero it's a valid
> shift truncation mask.  The documentation for TARGET_SHIFT_TRUNCATION_MASK
> says:
> " The default implementation of this function returns
>   @code{GET_MODE_BITSIZE (@var{mode}) - 1} if @code{SHIFT_COUNT_TRUNCATED}
>   and 0 otherwise."
> 
> So since for TARGET_SIMD we cannot guarantee that all shifts truncate their
> amount, we should be returning 0 in TARGET_SHIFT_TRUNCATION_MASK when
> SHIFT_COUNT_TRUNCATED is false.  This is what the patch does, and with it
> step 2) becomes:
> 2) tmp = BITS_PER_WORD - 1 - var_amnt
> which behaves correctly on aarch64.
> 
> Unfortunately, the testcase from the PR has very recently gone latent on
> trunk because it depends on register allocation picking a particular
> alternative from the *aarch64_ashl_sisd_or_int_<mode>3 pattern in aarch64.md.
> I tried to do some inline asm tricks to get it to force the correct
> constraints but was unsuccessful. Nevertheless I've included the testcase in
> the patch. I suppose it's better than nothing.

Thanks for the great write-up. This level of detail is very helpful for
review.

OK for trunk.

Thanks,
James

> 
> Bootstrapped and tested on aarch64.
> 
> Ok for trunk?
> 
> Thanks,
> Kyrill
> 
> 
> 2016-02-25  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
> 
>     PR target/69613
>     * config/aarch64/aarch64.c (aarch64_shift_truncation_mask):
>     Return 0 if !SHIFT_COUNT_TRUNCATED
> 
> 2016-02-25  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
> 
>     PR target/69613
>     * gcc.dg/torture/pr69613.c: New test.

> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 
> d0d15b4feee70a5ca6af8dd16c7667cbcd844dbf..2e69e345545e591d1de76c2d175aac476e6e1107
>  100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -11171,7 +11171,8 @@ static unsigned HOST_WIDE_INT
>  aarch64_shift_truncation_mask (machine_mode mode)
>  {
>    return
> -    (aarch64_vector_mode_supported_p (mode)
> +    (!SHIFT_COUNT_TRUNCATED
> +     || aarch64_vector_mode_supported_p (mode)
>       || aarch64_vect_struct_mode_p (mode)) ? 0 : (GET_MODE_BITSIZE (mode) - 
> 1);
>  }
>  
> diff --git a/gcc/testsuite/gcc.dg/torture/pr69613.c 
> b/gcc/testsuite/gcc.dg/torture/pr69613.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..44f2b0cc91ac4b12c4d255b608f95bc8bf016923
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/torture/pr69613.c
> @@ -0,0 +1,40 @@
> +/* PR target/69613.  */
> +/* { dg-do run { target int128 } } */
> +/* { dg-additional-options "-mavx" { target { i?86-*-* x86_64-*-* } } } */
> +
> +typedef unsigned short u16;
> +typedef unsigned short v32u16 __attribute__ ((vector_size (32)));
> +typedef unsigned int u32;
> +typedef unsigned int v32u32 __attribute__ ((vector_size (32)));
> +typedef unsigned long long u64;
> +typedef unsigned long long v32u64 __attribute__ ((vector_size (32)));
> +typedef unsigned __int128 u128;
> +typedef unsigned __int128 v32u128 __attribute__ ((vector_size (32)));
> +
> +u128 __attribute__ ((noinline, noclone))
> +foo (u32 u32_0, u64 u64_1, u128 u128_1, v32u16 v32u16_0, v32u128 v32u128_0,
> +     v32u16 v32u16_1, v32u32 v32u32_1, v32u64 v32u64_1, v32u128 v32u128_1)
> +{
> +  u128 temp = (v32u128_1[0] << ((-u32_0) & 127));
> +  u32 t2 = (u32_0 & 127);
> +  v32u128_1[0] = (v32u128_1[0] >> t2);
> +
> +  v32u128_1[0] ^= temp;
> +  v32u128_1 |= (v32u128){ v32u128_0[1] };
> +
> +  return u64_1 + u128_1 + v32u16_0[0] + v32u16_0[1] + v32u16_1[11]
> +      + v32u16_1[12] + v32u16_1[13] + v32u32_1[0] + v32u32_1[1]
> +      + v32u32_1[2] + v32u64_1[1] + v32u64_1[2] + v32u64_1[3] + v32u128_1[0]
> +      + v32u128_1[1];
> +}
> +
> +int
> +main ()
> +{
> +  u128 x
> +      = foo (1, 1, 1, (v32u16){ 1, 1, 1 }, (v32u128){ 1 }, (v32u16){ 1, 1, 1 
> },
> +          (v32u32){ 1 }, (v32u64){ 1, 1, 1 }, (v32u128){ -1 });
> +  if (x != 6)
> +    __builtin_abort ();
> +  return 0;
> +}

Reply via email to