[Bug target/70048] [6 Regression][AArch64] Inefficient local array addressing

jiwang at gcc dot gnu.org Thu, 10 Mar 2016 08:41:35 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70048


--- Comment #16 from Jiong Wang <jiwang at gcc dot gnu.org> ---
(In reply to Richard Henderson from comment #13)
> Created attachment 37911 [details]
> aggressive patch
> 

Cool! Thanks very much for experimenting this thoughtful new aggressive
direction.

But there is a performance issue as described at

  https://gcc.gnu.org/ml/gcc-patches/2016-02/msg00281.html

  "this patch forces register scaling expression out of memory ref, so that
   RTL CSE pass can handle common register scaling expressions"

This is particularly performance critial if a group of instructions are
using the same "scaled register" inside hot loop. CSE can reduce redundant
calculations.

r233136 was fixing this by forcing the scale part into register then turn
somelike:

  ldr dst1, [reg_base1, reg_index, #lsl 1]
  ldr dst2, [reg_base2, reg_index, #lsl 1]
  ldr dst3, [reg_base3, reg_index, #lsl 1]

into

  reg_index = reg_index << 1;
  ldr dst1, [reg_base1, reg_index]
  ldr dst2, [reg_base2, reg_index]
  ldr dst3, [reg_base3, reg_index]

For this aggresive version patch at #c13
===
  In TARGET_LEGITIMATE_ADDRESS_P, the aarch64_classify_index allows
  register scaled offset through those "MUL" checks, this is correct, but
  then address like [reg, reg, #lsl 1] will be judged valid and hiding
  behind MEM unvisiable to RTL passes until quite later stage.

For second patch at #c10
===
  I found op1 is *not* forced to register when op0 is virtual eliminable
  register.  As the perf issue mentioned above, I think we should always
  force op0 into register to expose the scaled calculation to gcc.

I can confirm the following:

  * both the patches at #c13 and #c10 will cause performance regression
    and reopen the issue r233136 fixed.

  * for the aggressive patch at #c13, if we explicitly don't allow
    (plus reg, (mul reg, imm)) in aarch64_classify_index_allow then
    there is no perf regression.

  * for the second patch at #c10, if we always do the following no matter
    op0 is virtual & eliminable or not

     "op1 = force_operand (op1, NULL_RTX);"

    then there is no performance regression as well. Even more, there is
    significant perf improvement beyond current trunk on one benchmark.

If the regression fixed, I fell the second patch at #c10 looks better if the
decision is to fix this by gcc 6.  As I feel expose RTL details to RTL passes
the earlier the better.

[Bug target/70048] [6 Regression][AArch64] Inefficient local array addressing

Reply via email to