Hi, On 2020/10/27 05:10, Segher Boessenkool wrote: > On Wed, Oct 21, 2020 at 03:25:29AM -0500, Xionghu Luo wrote: >> Don't split code from add<mode>3 for SDI to allow a later pass to split. > > This is very problematic. > >> This allows later logic to hoist out constant load in add instructions. > > Later logic should be able to do that any way (I do not say that works > perfectly, mind; it no doubt could be improved). > >> In loop, lis+ori could be hoisted out to improve performance compared with >> previous addis+addi (About 15% on typical case), weak point is >> one more register is used and one more instruction is generated. i.e.: > > Yes, better performance on one testcase, and worse code always :-( > >> addis 3,3,0x6765 >> addi 3,3,0x4321 >> >> => >> >> lis 9,0x6765 >> ori 9,9,0x4321 >> add 3,3,9 > > This is the typical kind of clumsy code you get if you generate RTL that > matches actual machine instructions too late ("split too late"). > > So, please make it possible to hoist 2-insn-immediate sequences out of > loops, *without* changing them to fake 1-insn things. >
As we discussed offline, addis+addi is not quite possible to be hoisted out of loops as not invariant, update the patch as below, thanks: [PATCH v2] rs6000: Split constant operator add in split1 instead of expander Currently, ADD with positive 32bit constant is split to addis+addi in expander, which seems too early to optimize the constant load out of loop compared with other targets. This patch use a temp register to load the constant and do two register addition in expander same as negative 32bit constant add. This allows loop invariant pass to hoist out constant load before add instructions, then split1 pass will split the load to lis+ori after combine. Performance could be improved by 15% on typical case compared with previous addis+addi in loop. (1) 0x67654321 addis 3,3,0x6765 addi 3,3,0x4321 => lis 9,0x6765 ori 9,9,0x4321 add 3,3,9 (2) 0x00008fff addis 9,9,0x1 addi 3,9,-28673 => li 10,0 ori 10,10,0x8fff add 3,3,10 Regression and bootstrap tested pass on P8LE. gcc/ChangeLog: 2020-10-21 Xiong Hu Luo <luo...@linux.ibm.com> * config/rs6000/rs6000.md (add<mode>3 for SDI): Don't split before reload, move constant to temp register for add. (define_split): Split const from split1. gcc/testsuite/ChangeLog: 2020-10-21 Xiong Hu Luo <luo...@linux.ibm.com> * gcc.target/powerpc/add-const.c: New test. --- gcc/config/rs6000/rs6000.md | 38 ++++++++++++-------- gcc/testsuite/gcc.target/powerpc/add-const.c | 18 ++++++++++ 2 files changed, 41 insertions(+), 15 deletions(-) create mode 100644 gcc/testsuite/gcc.target/powerpc/add-const.c diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md index 5e5ad9f7c3d..b52e9555962 100644 --- a/gcc/config/rs6000/rs6000.md +++ b/gcc/config/rs6000/rs6000.md @@ -1750,18 +1750,26 @@ (define_expand "add<mode>3" if (CONST_INT_P (operands[2]) && !add_operand (operands[2], <MODE>mode)) { - rtx tmp = ((!can_create_pseudo_p () - || rtx_equal_p (operands[0], operands[1])) - ? operands[0] : gen_reg_rtx (<MODE>mode)); - - /* Adding a constant to r0 is not a valid insn, so use a different - strategy in that case. */ - if (reg_or_subregno (operands[1]) == 0 || reg_or_subregno (tmp) == 0) + bool reg0 = reg_or_subregno (operands[0]) == 0; + if (can_create_pseudo_p () || reg0) { - if (operands[0] == operands[1]) - FAIL; - rs6000_emit_move (operands[0], operands[2], <MODE>mode); - emit_insn (gen_add<mode>3 (operands[0], operands[1], operands[0])); + + rtx tmp = (!can_create_pseudo_p () + || rtx_equal_p (operands[0], operands[1])) + ? operands[0] : gen_reg_rtx (<MODE>mode); + + /* Adding a constant to r0 is not a valid insn, so use a different + strategy in that case. See stack-limit.c, need generate + "24: %0:DI=0x20fa0; 25: %0:DI=%14:DI+%0:DI" in pro_and_epilogue + when can_create_pseudo_p is false. */ + if (reg0 == 0 || reg_or_subregno (tmp) == 0) + { + if (operands[0] == operands[1]) + FAIL; + } + + rs6000_emit_move (tmp, operands[2], <MODE>mode); + emit_insn (gen_add<mode>3 (operands[0], operands[1], tmp)); DONE; } @@ -1775,8 +1783,8 @@ (define_expand "add<mode>3" /* The ordering here is important for the prolog expander. When space is allocated from the stack, adding 'low' first may produce a temporary deallocation (which would be bad). */ - emit_insn (gen_add<mode>3 (tmp, operands[1], GEN_INT (rest))); - emit_insn (gen_add<mode>3 (operands[0], tmp, GEN_INT (low))); + emit_insn (gen_add<mode>3 (operands[0], operands[1], GEN_INT (rest))); + emit_insn (gen_add<mode>3 (operands[0], operands[0], GEN_INT (low))); DONE; } }) @@ -9118,7 +9126,7 @@ (define_split ;; When non-easy constants can go in the TOC, this should use ;; easy_fp_constant predicate. (define_split - [(set (match_operand:DI 0 "int_reg_operand_not_pseudo") + [(set (match_operand:DI 0 "int_reg_operand") (match_operand:DI 1 "const_int_operand"))] "TARGET_POWERPC64 && num_insns_constant (operands[1], DImode) > 1" [(set (match_dup 0) (match_dup 2)) @@ -9131,7 +9139,7 @@ (define_split }) (define_split - [(set (match_operand:DI 0 "int_reg_operand_not_pseudo") + [(set (match_operand:DI 0 "int_reg_operand") (match_operand:DI 1 "const_scalar_int_operand"))] "TARGET_POWERPC64 && num_insns_constant (operands[1], DImode) > 1" [(set (match_dup 0) (match_dup 2)) diff --git a/gcc/testsuite/gcc.target/powerpc/add-const.c b/gcc/testsuite/gcc.target/powerpc/add-const.c new file mode 100644 index 00000000000..3ffdbcfab89 --- /dev/null +++ b/gcc/testsuite/gcc.target/powerpc/add-const.c @@ -0,0 +1,18 @@ +/* { dg-do compile { target lp64 } } */ +/* { dg-options "-O3 -fno-unroll-loops" } */ + +/* Ensure the lis,ori are generated, which indicates they have + been hoisted outside of the loop. */ + +typedef unsigned long ulong; +ulong +foo (ulong n, ulong h) +{ + int i; + for (i = 0; i < n; i++) + h = ((h + 8) | h) + 0x67654321; + return h; +} + +/* { dg-final { scan-assembler-times {\mlis\M} 1 } } */ +/* { dg-final { scan-assembler-times {\mori\M} 1 } } */ -- 2.25.1