https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70048
--- Comment #16 from Jiong Wang <jiwang at gcc dot gnu.org> --- (In reply to Richard Henderson from comment #13) > Created attachment 37911 [details] > aggressive patch > Cool! Thanks very much for experimenting this thoughtful new aggressive direction. But there is a performance issue as described at https://gcc.gnu.org/ml/gcc-patches/2016-02/msg00281.html "this patch forces register scaling expression out of memory ref, so that RTL CSE pass can handle common register scaling expressions" This is particularly performance critial if a group of instructions are using the same "scaled register" inside hot loop. CSE can reduce redundant calculations. r233136 was fixing this by forcing the scale part into register then turn somelike: ldr dst1, [reg_base1, reg_index, #lsl 1] ldr dst2, [reg_base2, reg_index, #lsl 1] ldr dst3, [reg_base3, reg_index, #lsl 1] into reg_index = reg_index << 1; ldr dst1, [reg_base1, reg_index] ldr dst2, [reg_base2, reg_index] ldr dst3, [reg_base3, reg_index] For this aggresive version patch at #c13 === In TARGET_LEGITIMATE_ADDRESS_P, the aarch64_classify_index allows register scaled offset through those "MUL" checks, this is correct, but then address like [reg, reg, #lsl 1] will be judged valid and hiding behind MEM unvisiable to RTL passes until quite later stage. For second patch at #c10 === I found op1 is *not* forced to register when op0 is virtual eliminable register. As the perf issue mentioned above, I think we should always force op0 into register to expose the scaled calculation to gcc. I can confirm the following: * both the patches at #c13 and #c10 will cause performance regression and reopen the issue r233136 fixed. * for the aggressive patch at #c13, if we explicitly don't allow (plus reg, (mul reg, imm)) in aarch64_classify_index_allow then there is no perf regression. * for the second patch at #c10, if we always do the following no matter op0 is virtual & eliminable or not "op1 = force_operand (op1, NULL_RTX);" then there is no performance regression as well. Even more, there is significant perf improvement beyond current trunk on one benchmark. If the regression fixed, I fell the second patch at #c10 looks better if the decision is to fix this by gcc 6. As I feel expose RTL details to RTL passes the earlier the better.