Re: [PATCH ARM]Refine scaled address expression on ARM

Richard Biener Fri, 29 Nov 2013 03:29:55 -0800

On Fri, Nov 29, 2013 at 12:03 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Fri, Nov 29, 2013 at 6:44 PM, Richard Biener
> <richard.guent...@gmail.com> wrote:
>> On Fri, Nov 29, 2013 at 8:52 AM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>> On Thu, Nov 28, 2013 at 8:06 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>> On Thu, Nov 28, 2013 at 6:48 PM, Richard Earnshaw <rearn...@arm.com> wrote:
>>>>> On 18/09/13 10:15, bin.cheng wrote:
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
>>>>>>> ow...@gcc.gnu.org] On Behalf Of bin.cheng
>>>>>>> Sent: Monday, September 02, 2013 3:09 PM
>>>>>>> To: Richard Earnshaw
>>>>>>> Cc: gcc-patches@gcc.gnu.org
>>>>>>> Subject: RE: [PATCH ARM]Refine scaled address expression on ARM
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Richard Earnshaw
>>>>>>>> Sent: Thursday, August 29, 2013 9:06 PM
>>>>>>>> To: Bin Cheng
>>>>>>>> Cc: gcc-patches@gcc.gnu.org
>>>>>>>> Subject: Re: [PATCH ARM]Refine scaled address expression on ARM
>>>>>>>>
>>>>>>>> On 28/08/13 08:00, bin.cheng wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> This patch refines scaled address expression on ARM.  It supports
>>>>>>>>> "base+index*scale" in arm_legitimate_address_outer_p.  It also tries
>>>>>>>>> to legitimize "base + index * scale + offset" with "reg <- base +
>>>>>>>>> offset;  reg
>>>>>>>>> + index * scale" by introducing thumb2_legitimize_address.  For now
>>>>>>>>> + function
>>>>>>>>> thumb2_legitimize_address is a kind of placeholder and just does the
>>>>>>>>> mentioned transformation by calling to try_multiplier_address.
>>>>>>>>> Hoping we can improve it in the future.
>>>>>>>>>
>>>>>>>>> With this patch:
>>>>>>>>> 1) "base+index*scale" is recognized.
>>>>>>>>
>>>>>>>> That's because (PLUS (REG) (MULT (REG) (CONST))) is not canonical form.
>>>>>>>>  So this shouldn't be necessary.  Can you identify where this
>>>>>>> non-canoncial form is being generated?
>>>>>>>>
>>>>>>>
>>>>>>> Oh, for now ivopt constructs "index*scale" to test whether backend
>>>>>>> supports scaled addressing mode, which is not valid on ARM, so I was 
>>>>>>> going
>>>>>>> to construct "base + index*scale" instead.  Since "base + index * scale"
>>>>>> is not
>>>>>>> canonical form, I will construct the canonical form and drop this part 
>>>>>>> of
>>>>>> the
>>>>>>> patch.
>>>>>>>
>>>>>>> Is rest of this patch OK?
>>>>>>>
>>>>>> Hi Richard, I removed the part over which you concerned and created this
>>>>>> updated patch.
>>>>>>
>>>>>> Is it OK?
>>>>>>
>>>>>> Thanks.
>>>>>> bin
>>>>>>
>>>>>> 2013-09-18  Bin Cheng  <bin.ch...@arm.com>
>>>>>>
>>>>>>       * config/arm/arm.c (try_multiplier_address): New function.
>>>>>>       (thumb2_legitimize_address): New function.
>>>>>>       (arm_legitimize_address): Call try_multiplier_address and
>>>>>>       thumb2_legitimize_address.
>>>>>>
>>>>>>
>>>>>> 6-arm-scaled_address-20130918.txt
>>>>>>
>>>>>>
>>>>>> Index: gcc/config/arm/arm.c
>>>>>> ===================================================================
>>>>>> --- gcc/config/arm/arm.c      (revision 200774)
>>>>>> +++ gcc/config/arm/arm.c      (working copy)
>>>>>> @@ -6652,6 +6654,106 @@ legitimize_tls_address (rtx x, rtx reg)
>>>>>>      }
>>>>>>  }
>>>>>>
>>>>>> +/* Try to find address expression like base + index * scale + offset
>>>>>> +   in X.  If we find one, force base + offset into register and
>>>>>> +   construct new expression reg + index * scale; return the new
>>>>>> +   address expression if it's valid.  Otherwise return X.  */
>>>>>> +static rtx
>>>>>> +try_multiplier_address (rtx x, enum machine_mode mode ATTRIBUTE_UNUSED)
>>>>>> +{
>>>>>> +  rtx tmp, base_reg, new_rtx;
>>>>>> +  rtx base = NULL_RTX, index = NULL_RTX, scale = NULL_RTX, offset = 
>>>>>> NULL_RTX;
>>>>>> +
>>>>>> +  gcc_assert (GET_CODE (x) == PLUS);
>>>>>> +
>>>>>> +  /* Try to find and record base/index/scale/offset in X. */
>>>>>> +  if (GET_CODE (XEXP (x, 1)) == MULT)
>>>>>> +    {
>>>>>> +      tmp = XEXP (x, 0);
>>>>>> +      index = XEXP (XEXP (x, 1), 0);
>>>>>> +      scale = XEXP (XEXP (x, 1), 1);
>>>>>> +      if (GET_CODE (tmp) != PLUS)
>>>>>> +     return x;
>>>>>> +
>>>>>> +      base = XEXP (tmp, 0);
>>>>>> +      offset = XEXP (tmp, 1);
>>>>>> +    }
>>>>>> +  else
>>>>>> +    {
>>>>>> +      tmp = XEXP (x, 0);
>>>>>> +      offset = XEXP (x, 1);
>>>>>> +      if (GET_CODE (tmp) != PLUS)
>>>>>> +     return x;
>>>>>> +
>>>>>> +      base = XEXP (tmp, 0);
>>>>>> +      scale = XEXP (tmp, 1);
>>>>>> +      if (GET_CODE (base) == MULT)
>>>>>> +     {
>>>>>> +       tmp = base;
>>>>>> +       base = scale;
>>>>>> +       scale = tmp;
>>>>>> +     }
>>>>>> +      if (GET_CODE (scale) != MULT)
>>>>>> +     return x;
>>>>>> +
>>>>>> +      index = XEXP (scale, 0);
>>>>>> +      scale = XEXP (scale, 1);
>>>>>> +    }
>>>>>> +
>>>>>> +  if (CONST_INT_P (base))
>>>>>> +    {
>>>>>> +      tmp = base;
>>>>>> +      base = offset;
>>>>>> +      offset = tmp;
>>>>>> +    }
>>>>>> +
>>>>>> +  if (CONST_INT_P (index))
>>>>>> +    {
>>>>>> +      tmp = index;
>>>>>> +      index = scale;
>>>>>> +      scale = tmp;
>>>>>> +    }
>>>>>> +
>>>>>> +  /* ARM only supports constant scale in address.  */
>>>>>> +  if (!CONST_INT_P (scale))
>>>>>> +    return x;
>>>>>> +
>>>>>> +  if (GET_MODE (base) != SImode || GET_MODE (index) != SImode)
>>>>>> +    return x;
>>>>>> +
>>>>>> +  /* Only register/constant are allowed in each part.  */
>>>>>> +  if (!symbol_mentioned_p (base)
>>>>>> +      && !symbol_mentioned_p (offset)
>>>>>> +      && !symbol_mentioned_p (index)
>>>>>> +      && !symbol_mentioned_p (scale))
>>>>>> +    {
>>>>>
>>>>> It would be easier to do this at the top of the function --
>>>>>   if (symbol_mentioned_p (x))
>>>>>     return x;
>>>>>
>>>>>
>>>>>> +      /* Force "base+offset" into register and construct
>>>>>> +      "register+index*scale".  Return the new expression
>>>>>> +      only if it's valid.  */
>>>>>> +      tmp = gen_rtx_PLUS (SImode, base, offset);
>>>>>> +      base_reg = force_reg (SImode, tmp);
>>>>>> +      tmp = gen_rtx_fmt_ee (MULT, SImode, index, scale);
>>>>>> +      new_rtx = gen_rtx_PLUS (SImode, base_reg, tmp);
>>>>>> +      return new_rtx;
>>>>>
>>>>> I can't help thinking that this is backwards.  That is, you want to
>>>>> split out the mult expression and use offset addressing in the addresses
>>>>> itself.  That's likely to lead to either better CSE, or more induction
>>>> Thanks to your review.
>>>>
>>>> Actually base+offset is more likely loop invariant than scaled
>>>> expression, just as reported in pr57540.  The bug is observed in
>>>> spec2k bzip/gzip, and hurts arm in hot loops.  The loop induction
>>>> variable doesn't matter that much comparing to invariant because we
>>>> are in RTL now.
>>>>
>>>>> vars.  Furthermore, scaled addressing modes are likely to be more
>>>>> expensive than simple offset addressing modes on at least some cores.
>>>> The purpose is to CSE (as you pointed out) or hoist loop invariant.
>>>> As for addressing mode is concerned, Though we may guide the
>>>> transformation once we have reliable address cost mode, we don't have
>>>> the information if base+offset is invariant, so it's difficult to
>>>> handle in backend, right?
>>>>
>>>> What do you think about this?
>>>>
>>>
>>> Additionally, there is no induction variable issue here.  The memory
>>> reference we are facing during expand are not lowered by gimple IVOPT,
>>> which means either it's outside loop, or doesn't have induction
>>> variable address.
>>>
>>> Generally, there are three passes which lowers memory reference:
>>> gimple strength reduction picks opportunity outside loop; gimple IVOPT
>>> handles references with induction variable addresses; references not
>>> handled by SLSR/IVOPT are handled by RTL expand, which makes bad
>>> choice of address re-association.  I think Yufeng also noticed the
>>> problem and are trying with patch like:
>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg02878.html
>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03339.html
>>>
>>> After thinking twice, I some kind of think we should not re-associate
>>> addresses during expanding, because of lacking of context information.
>>>  Take base + scaled_index + offset as an example in PR57540, we just
>>> don't know if "base+offset" is loop invariant from either backend or
>>> RTL expander.
>>>
>>> Any comments?
>>
>> The issue is that re-association and address lowering is not integrated
>> with optimizers such as CSE and invariant motion.  This means it's
>> hard to arrive at optimal results and all passes are just "guessing"
>> and applying heuristics that the programmer thought are appropriate
>> for his machine.
>>
>> I have no easy way out but think that we lower addresses too late
>> (at least fully).  Loop optimizers would like to see the complex
>> memory reference forms but starting with IVOPTs lowering all
>> addresses would likely be a win in the end (there are CSE, LIM
>> and re-association passes around after/at this point as well).
>>
>> Back in the 4.3 days when I for the first time attempted to introduce
>> MEM_REF I forcefully lowered all memory accesses and addresses
>> very early (during gimplification basically).  That didn't work out well.
>>
>> So my suggestion to anyone who wants to try something is
>> to apply an additional lowering to the whole GIMPLE, lowering
>> things like handled-component references and addresses
>> (and also things like bitfield accesses).
> Yes, once in expander, there will be no enough information.  Since we
> don't need to handle references with induction variable addresses
> after IVOPT, it's not difficult to take invariant into consideration
> when re-associating.  Yet I have no clue how CSE should be considered.


You'd need to detect that (a + b)*c can be computed differently
if both a*c and b*c are available for example.  There is even a PR
about this somewhere.

> Nevertheless, force "base+scaled*index+offset" into register and use
> reg directed addressing mode like PR57540 in expand is not good.

Sure ... this is why we have IVOPTs and SLSR.  But both would
work ok with lowered input I think and would simply build up
a "leveled" IL suitable for the target.  IVOPTs also does some
simple CSE considerations AFAIK.  In the end what matters for
speed is loops after all ;)

Richard.

> Thanks,
> bin
>
> --
> Best Regards.

Re: [PATCH ARM]Refine scaled address expression on ARM

Reply via email to