Re: [PATCH v2] dse: Remove partial load after full store for high part access[PR71309]

Richard Sandiford Wed, 22 Jul 2020 04:06:15 -0700

luoxhu <luo...@linux.ibm.com> writes:
> Hi,
>
> On 2020/7/21 23:30, Richard Sandiford wrote:
>> Xiong Hu Luo <luo...@linux.ibm.com> writes:>> @@ -1872,9 +1872,27 @@ 
>> get_stored_val (store_info *store_info, machine_mode read_mode,
>>>       {
>>>         poly_int64 shift = gap * BITS_PER_UNIT;
>>>         poly_int64 access_size = GET_MODE_SIZE (read_mode) + gap;
>>> -      read_reg = find_shift_sequence (access_size, store_info, read_mode,
>>> -                                 shift, optimize_bb_for_speed_p (bb),
>>> -                                 require_cst);
>>> +      rtx rhs_subreg = NULL;
>>> +
>>> +      if (known_eq (GET_MODE_BITSIZE (store_mode), shift * 2))
>>> +   {
>>> +     scalar_int_mode inner_mode = smallest_int_mode_for_size (shift);
>>> +     poly_uint64 sub_off
>>> +       = ((!BYTES_BIG_ENDIAN)
>>> +            ? GET_MODE_SIZE (store_mode) - GET_MODE_SIZE (inner_mode)
>>> +            : 0);
>>> +
>>> +     rhs_subreg = simplify_gen_subreg (inner_mode, store_info->rhs,
>>> +                                       store_mode, sub_off);
>>> +     if (rhs_subreg)
>>> +       read_reg
>>> +         = extract_low_bits (read_mode, inner_mode, copy_rtx (rhs_subreg));
>>> +   }
>>> +
>>> +      if (read_reg == NULL)
>>> +   read_reg
>>> +     = find_shift_sequence (access_size, store_info, read_mode, shift,
>>> +                            optimize_bb_for_speed_p (bb), require_cst);
>> 
>> Did you consider doing this in find_shift_sequence instead?
>> ISTM that this is really using subregs to optimise:
>> 
>>        /* In theory we could also check for an ashr.  Ian Taylor knows
>>       of one dsp where the cost of these two was not the same.  But
>>       this really is a rare case anyway.  */
>>        target = expand_binop (new_mode, lshr_optab, new_reg,
>>                           gen_int_shift_amount (new_mode, shift),
>>                           new_reg, 1, OPTAB_DIRECT);
>> 
>> I think everything up to:
>> 
>>        /* Also try a wider mode if the necessary punning is either not
>>       desirable or not possible.  */
>>        if (!CONSTANT_P (store_info->rhs)
>>        && !targetm.modes_tieable_p (new_mode, store_mode))
>>      continue;
>> 
>> is either neutral or helpful for the subreg case too, so maybe
>> we could just add the optimisation after that.  (It probably isn't
>> worth reusing any of the later loop body code though, since the
>> subreg case is much simpler.)
>> 
>> I don't think we need to restrict this case to modes of size
>> shift * 2.  We can just check whether the shift is a multiple of
>> the new_mode calculated by find_shift_sequence (using multiple_p).
>> 
>> An easier way of converting the shift to a subreg byte offset
>> is to use subreg_offset_from_lsb, which also handles
>> BYTES_BIG_ENDIAN != WORDS_BIG_ENDIAN.
>> 
>
> Thanks, I've updated the patch by moving it into find_shift_sequence.
> Not sure whether meets your comments precisely though it still works:)
> There is a comment mentioned that 
>
>   /* Some machines like the x86 have shift insns for each size of
>      operand.  Other machines like the ppc or the ia-64 may only have
>      shift insns that shift values within 32 or 64 bit registers.
>      This loop tries to find the smallest shift insn that will right
>      justify the value we want to read but is available in one insn on
>      the machine.  */
>
> So it will early break without some additional check as the new_mode is
> TImode here:
>
>       if (GET_MODE_BITSIZE (new_mode) > BITS_PER_WORD)
>       break;
>
>
>
> [PATCH v2] dse: Remove partial load after full store for high part 
> access[PR71309]
>
>
> This patch could optimize (works for char/short/int/void*):
>
> 6: r119:TI=[r118:DI+0x10]
> 7: [r118:DI]=r119:TI
> 8: r121:DI=[r118:DI+0x8]
>
> =>
>
> 6: r119:TI=[r118:DI+0x10]
> 18: r122:TI=r119:TI
> 16: r123:TI#0=r122:TI#8 0>>0
> 17: r123:TI#8=0
> 19: r124:DI=r123:TI#0
> 7: [r118:DI]=r119:TI
> 8: r121:DI=r124:DI
>
> Final ASM will be as below without partial load after full store(stxv+ld):
>   mr 9,3
>   ld 3,24(3)
>   ld 10,16(3)
>   std 3,8(9)
>   std 10,0(9)
>   blr
>
> It could achieve ~25% performance improvement for typical cases on
> Power9.  Bootstrap and regression testing on Power9-LE.
>
> For AArch64, one ldr is replaced by mov:
>
> ldp     x2, x3, [x0, 16]
> stp     x2, x3, [x0]
> ldr     x0, [x0, 8]
>
> =>
>
> mov     x1, x0
> ldp     x2, x0, [x0, 16]
> stp     x2, x0, [x1]
>
> gcc/ChangeLog:
>
> 2020-07-22  Xionghu Luo  <luo...@linux.ibm.com>
>
>       PR rtl-optimization/71309
>       * dse.c (find_shift_sequence): Use subreg of shifted from high part
>       register to avoid loading from address.
>
> gcc/testsuite/ChangeLog:
>
> 2020-07-22  Xionghu Luo  <luo...@linux.ibm.com>
>
>       PR rtl-optimization/71309
>       * gcc.target/powerpc/pr71309.c: New test.
> ---
>  gcc/dse.c                                  | 15 +++++++++-
>  gcc/testsuite/gcc.target/powerpc/pr71309.c | 33 ++++++++++++++++++++++
>  2 files changed, 47 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr71309.c
>
> diff --git a/gcc/dse.c b/gcc/dse.c
> index bbe792e48e8..e06a9fbb0cd 100644
> --- a/gcc/dse.c
> +++ b/gcc/dse.c
> @@ -1736,7 +1736,8 @@ find_shift_sequence (poly_int64 access_size,
>        int cost;
>  
>        new_mode = new_mode_iter.require ();
> -      if (GET_MODE_BITSIZE (new_mode) > BITS_PER_WORD)
> +      if (GET_MODE_BITSIZE (new_mode) > BITS_PER_WORD
> +       && !multiple_p (GET_MODE_BITSIZE (new_mode), shift))
>       break;
>  
>        /* If a constant was stored into memory, try to simplify it here,
> @@ -1793,6 +1794,18 @@ find_shift_sequence (poly_int64 access_size,
>        shift_seq = get_insns ();
>        end_sequence ();
>  
> +      /* Use subreg shifting from high part to avoid full store followed by
> +      partial load.  See PR71309.  */
> +      if (multiple_p (GET_MODE_BITSIZE (new_mode), shift) && shift_seq)
> +     {
> +       new_lhs = extract_low_bits (new_mode, store_mode,
> +                                   copy_rtx (store_info->rhs));
> +       emit_move_insn (new_reg, new_lhs);
> +       emit_insn (shift_seq);
> +       read_reg = extract_low_bits (read_mode, new_mode, target);
> +       break;
> +     }
> +


This wasn't really what I meant.  Using subregs is fine, but I was
thinking of:

      /* Also try a wider mode if the necessary punning is either not
         desirable or not possible.  */
      if (!CONSTANT_P (store_info->rhs)
          && !targetm.modes_tieable_p (new_mode, store_mode))
        continue;

      if (multiple_p (shift, GET_MODE_BITSIZE (new_mode)))
        {
          /* Try to implement the shift using a subreg.  */
          poly_int64 offset = subreg_offset_from_lsb (new_mode, store_mode,
                                                      shift);
          rhs_subreg = simplify_gen_subreg (new_mode, store_info->rhs,
                                            store_mode, offset);
          if (rhs_subreg)
            {
              ...
              break;
            }
        }

where the rhs_subreg is from your original patch.

The multiple_p should be that way round: the shift needs to be a
multiple of the new_mode for the subreg to be valid.

I think this should also avoid the BITS_PER_WORD problem.  On the
other hand, I agree BITS_PER_UNIT isn't a very sensible limit if
we're using subregs, so maybe moving it to after the multiple_p
if block would still make sense.

Thanks,
Richard

Re: [PATCH v2] dse: Remove partial load after full store for high part access[PR71309]

Reply via email to