This patch is aimed at fixing test failures introduced by my 2013-12-07 change to bswapdi2_32bit: FAIL: gcc.target/powerpc/pr53199.c scan-assembler-times lwbrx 6 FAIL: gcc.target/powerpc/pr53199.c scan-assembler-times stwbrx 6
The 2013-12-07 change was necessary to make -m32 -mlra generate good code for pr53199.c:reg_reverse. Too many '?'s on the r->r alternative result in lra never choosing that option. Instead we get Z->r, ie. storing the input reg to a stack temp then using lwbrx from there. That means we have a load-hit-store flush with a severe slowdown. (See http://gcc.gnu.org/ml/gcc-patches/2013-12/msg00002.html for the corresponding -m64 result, a 4x slowdown.) A similar problem occurs with -m64 -mcpu=power7 -mlra due to bswapdi2_ldbrx having two '?'s on r->r. To fix this I ran into a whole lot of pain. reload and lra are quite different in their selection of insn alternatives. I could not find any combination of '?'s that generated the best code for both reload an lra on pr53199.c. To see why, it's necessary to look (from a great height) at how reload chooses amongst alternatives. A particular alternative gets a "loser" score, with the lowest "loser" being chosen. "loser" is incremented a) when an operand does not match its constraint. b) when an alternative has a '?' in the constraint. c) when a scratch register is required. d) when an early clobber output clashes with one of the inputs. a) is fairly obvious. For example, if we have a MEM when the operand alternative needs a reg, then we'll require a reload. b) is also quite obvious. Multiple '?'s accumulate. c) is a little more subtle. It bites you when alternatives require differing numbers of scratch registers. Take for example bswapdi2_64bit, which before this patch has three alternatives Z->r with 3 scratch regs, (Z is a subset of m) r->Z with 2 scratch regs, r->r with 3 scratch regs. All other things being equal, with reload you could correct this disparity by adding a '?' to the r->Z alternative. We might want to do that so that Z->r and r->r are the same "distance" apart as r->Z is from r->r. With lra it seems that scratch regs are weighted differently.. d) is also tricky, and a trap for anyone optimizing insn selection for functions like some in pr53199.c that have just one rtl insn with early clobbers. PowerPC generally returns function results in the same register as the first argument, so these hit the early clobber. Code elsewhere in larger functions probably won't. lra penalizes early clobbers differently to reload (a lot less). So, putting this all together.. Point (d) test implication is covered by the additional functions in pr53199.c. Avoiding early clobbers where possible is good since it reduces differences between reload and lra insn alternative costing. We also generate better code. I managed to do this for all the Z->r bswapdi patterns, but stopped short of changing r->r as well since at that point everything looked OK! Avoiding extra scratch registers for one alternative is also good. The op4 r->r scratch wasn't even used, and op4 for Z->r fairly easy to do without. Renaming word_high/low to word1/2 was to make it a little easier to track lifetime of addr1/2. Bootstrapped and regression tested powerpc64-linux. Output of pr53199.c inspected for sanity with -mcpu=power{6,7} -m{32,64} and {-mlra,}. OK to apply? gcc/ * config/rs6000/rs6000.md (bswapdi2): Remove one scratch reg. Modify Z->r bswapdi splitter to use dest in place of scratch. In r->Z and Z->r bswapdi splitter rename word_high, word_low to word1, word2 and rearrange logic to suit. (bswapdi2_64bit): Remove early clobber on Z->r alternative. (bswapdi2_ldbrx): Likewise. Remove '??' on r->r. (bswapdi2_32bit): Remove early clobber on Z->r alternative. Add one '?' on r->r. Modify Z->r splitter to avoid need for early clobber. gcc/testsuite/ * gcc.target/powerpc/pr53199.c: Add extra functions. Index: gcc/config/rs6000/rs6000.md =================================================================== --- gcc/config/rs6000/rs6000.md (revision 206009) +++ gcc/config/rs6000/rs6000.md (working copy) @@ -2344,8 +2344,7 @@ (bswap:DI (match_operand:DI 1 "reg_or_mem_operand" ""))) (clobber (match_scratch:DI 2 "")) - (clobber (match_scratch:DI 3 "")) - (clobber (match_scratch:DI 4 ""))])] + (clobber (match_scratch:DI 3 ""))])] "" { if (!REG_P (operands[0]) && !REG_P (operands[1])) @@ -2363,11 +2362,10 @@ ;; Power7/cell has ldbrx/stdbrx, so use it directly (define_insn "*bswapdi2_ldbrx" - [(set (match_operand:DI 0 "reg_or_mem_operand" "=&r,Z,??&r") + [(set (match_operand:DI 0 "reg_or_mem_operand" "=r,Z,&r") (bswap:DI (match_operand:DI 1 "reg_or_mem_operand" "Z,r,r"))) (clobber (match_scratch:DI 2 "=X,X,&r")) - (clobber (match_scratch:DI 3 "=X,X,&r")) - (clobber (match_scratch:DI 4 "=X,X,&r"))] + (clobber (match_scratch:DI 3 "=X,X,&r"))] "TARGET_POWERPC64 && TARGET_LDBRX && (REG_P (operands[0]) || REG_P (operands[1]))" "@ @@ -2379,11 +2377,10 @@ ;; Non-power7/cell, fall back to use lwbrx/stwbrx (define_insn "*bswapdi2_64bit" - [(set (match_operand:DI 0 "reg_or_mem_operand" "=&r,Z,&r") + [(set (match_operand:DI 0 "reg_or_mem_operand" "=r,Z,&r") (bswap:DI (match_operand:DI 1 "reg_or_mem_operand" "Z,r,r"))) (clobber (match_scratch:DI 2 "=&b,&b,&r")) - (clobber (match_scratch:DI 3 "=&r,&r,&r")) - (clobber (match_scratch:DI 4 "=&r,X,&r"))] + (clobber (match_scratch:DI 3 "=&r,&r,&r"))] "TARGET_POWERPC64 && !TARGET_LDBRX && (REG_P (operands[0]) || REG_P (operands[1])) && !(MEM_P (operands[0]) && MEM_VOLATILE_P (operands[0])) @@ -2395,8 +2392,7 @@ [(set (match_operand:DI 0 "gpc_reg_operand" "") (bswap:DI (match_operand:DI 1 "indexed_or_indirect_operand" ""))) (clobber (match_operand:DI 2 "gpc_reg_operand" "")) - (clobber (match_operand:DI 3 "gpc_reg_operand" "")) - (clobber (match_operand:DI 4 "gpc_reg_operand" ""))] + (clobber (match_operand:DI 3 "gpc_reg_operand" ""))] "TARGET_POWERPC64 && !TARGET_LDBRX && reload_completed" [(const_int 0)] " @@ -2405,15 +2401,14 @@ rtx src = operands[1]; rtx op2 = operands[2]; rtx op3 = operands[3]; - rtx op4 = operands[4]; rtx op3_32 = simplify_gen_subreg (SImode, op3, DImode, BYTES_BIG_ENDIAN ? 4 : 0); - rtx op4_32 = simplify_gen_subreg (SImode, op4, DImode, - BYTES_BIG_ENDIAN ? 4 : 0); + rtx dest_32 = simplify_gen_subreg (SImode, dest, DImode, + BYTES_BIG_ENDIAN ? 4 : 0); rtx addr1; rtx addr2; - rtx word_high; - rtx word_low; + rtx word1; + rtx word2; addr1 = XEXP (src, 0); if (GET_CODE (addr1) == PLUS) @@ -2438,29 +2433,29 @@ addr2 = gen_rtx_PLUS (Pmode, op2, addr1); } + word1 = change_address (src, SImode, addr1); + word2 = change_address (src, SImode, addr2); + if (BYTES_BIG_ENDIAN) { - word_high = change_address (src, SImode, addr1); - word_low = change_address (src, SImode, addr2); + emit_insn (gen_bswapsi2 (op3_32, word2)); + emit_insn (gen_bswapsi2 (dest_32, word1)); } else { - word_high = change_address (src, SImode, addr2); - word_low = change_address (src, SImode, addr1); + emit_insn (gen_bswapsi2 (op3_32, word1)); + emit_insn (gen_bswapsi2 (dest_32, word2)); } - emit_insn (gen_bswapsi2 (op3_32, word_low)); - emit_insn (gen_bswapsi2 (op4_32, word_high)); - emit_insn (gen_ashldi3 (dest, op3, GEN_INT (32))); - emit_insn (gen_iordi3 (dest, dest, op4)); + emit_insn (gen_ashldi3 (op3, op3, GEN_INT (32))); + emit_insn (gen_iordi3 (dest, dest, op3)); }") (define_split [(set (match_operand:DI 0 "indexed_or_indirect_operand" "") (bswap:DI (match_operand:DI 1 "gpc_reg_operand" ""))) (clobber (match_operand:DI 2 "gpc_reg_operand" "")) - (clobber (match_operand:DI 3 "gpc_reg_operand" "")) - (clobber (match_operand:DI 4 "" ""))] + (clobber (match_operand:DI 3 "gpc_reg_operand" ""))] "TARGET_POWERPC64 && !TARGET_LDBRX && reload_completed" [(const_int 0)] " @@ -2475,8 +2470,8 @@ BYTES_BIG_ENDIAN ? 4 : 0); rtx addr1; rtx addr2; - rtx word_high; - rtx word_low; + rtx word1; + rtx word2; addr1 = XEXP (dest, 0); if (GET_CODE (addr1) == PLUS) @@ -2501,27 +2496,28 @@ addr2 = gen_rtx_PLUS (Pmode, op2, addr1); } + word1 = change_address (dest, SImode, addr1); + word2 = change_address (dest, SImode, addr2); + emit_insn (gen_lshrdi3 (op3, src, GEN_INT (32))); + if (BYTES_BIG_ENDIAN) { - word_high = change_address (dest, SImode, addr1); - word_low = change_address (dest, SImode, addr2); + emit_insn (gen_bswapsi2 (word1, src_si)); + emit_insn (gen_bswapsi2 (word2, op3_si)); } else { - word_high = change_address (dest, SImode, addr2); - word_low = change_address (dest, SImode, addr1); + emit_insn (gen_bswapsi2 (word2, src_si)); + emit_insn (gen_bswapsi2 (word1, op3_si)); } - emit_insn (gen_bswapsi2 (word_high, src_si)); - emit_insn (gen_bswapsi2 (word_low, op3_si)); }") (define_split [(set (match_operand:DI 0 "gpc_reg_operand" "") (bswap:DI (match_operand:DI 1 "gpc_reg_operand" ""))) (clobber (match_operand:DI 2 "gpc_reg_operand" "")) - (clobber (match_operand:DI 3 "gpc_reg_operand" "")) - (clobber (match_operand:DI 4 "" ""))] + (clobber (match_operand:DI 3 "gpc_reg_operand" ""))] "TARGET_POWERPC64 && reload_completed" [(const_int 0)] " @@ -2544,7 +2540,7 @@ }") (define_insn "bswapdi2_32bit" - [(set (match_operand:DI 0 "reg_or_mem_operand" "=&r,Z,&r") + [(set (match_operand:DI 0 "reg_or_mem_operand" "=r,Z,?&r") (bswap:DI (match_operand:DI 1 "reg_or_mem_operand" "Z,r,r"))) (clobber (match_scratch:SI 2 "=&b,&b,X"))] "!TARGET_POWERPC64 && (REG_P (operands[0]) || REG_P (operands[1]))" @@ -2573,7 +2569,8 @@ if (GET_CODE (addr1) == PLUS) { emit_insn (gen_add3_insn (op2, XEXP (addr1, 0), GEN_INT (4))); - if (TARGET_AVOID_XFORM) + if (TARGET_AVOID_XFORM + || REGNO (XEXP (addr1, 1)) == REGNO (dest2)) { emit_insn (gen_add3_insn (op2, XEXP (addr1, 1), op2)); addr2 = op2; @@ -2581,7 +2578,8 @@ else addr2 = gen_rtx_PLUS (SImode, op2, XEXP (addr1, 1)); } - else if (TARGET_AVOID_XFORM) + else if (TARGET_AVOID_XFORM + || REGNO (addr1) == REGNO (dest2)) { emit_insn (gen_add3_insn (op2, addr1, GEN_INT (4))); addr2 = op2; @@ -2596,6 +2594,8 @@ word2 = change_address (src, SImode, addr2); emit_insn (gen_bswapsi2 (dest2, word1)); + /* The REGNO (dest2) tests above ensure that addr2 has not been trashed, + thus allowing us to omit an early clobber on the output. */ emit_insn (gen_bswapsi2 (dest1, word2)); }") Index: gcc/testsuite/gcc.target/powerpc/pr53199.c =================================================================== --- gcc/testsuite/gcc.target/powerpc/pr53199.c (revision 206009) +++ gcc/testsuite/gcc.target/powerpc/pr53199.c (working copy) @@ -1,7 +1,7 @@ /* { dg-do compile { target { powerpc*-*-* } } } */ /* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */ /* { dg-options "-O2 -mcpu=power6 -mavoid-indexed-addresses" } */ -/* { dg-final { scan-assembler-times "lwbrx" 6 } } */ +/* { dg-final { scan-assembler-times "lwbrx" 12 } } */ /* { dg-final { scan-assembler-times "stwbrx" 6 } } */ /* PR 51399: bswap gets an error if -mavoid-indexed-addresses was used in @@ -25,6 +25,24 @@ return __builtin_bswap64 (p[i]); } +long long +load64_reverse_4 (long long dummy __attribute__ ((unused)), long long *p) +{ + return __builtin_bswap64 (*p); +} + +long long +load64_reverse_5 (long long dummy __attribute__ ((unused)), long long *p) +{ + return __builtin_bswap64 (p[1]); +} + +long long +load64_reverse_6 (long long dummy __attribute__ ((unused)), long long *p, int i) +{ + return __builtin_bswap64 (p[i]); +} + void store64_reverse_1 (long long *p, long long x) { @@ -44,7 +62,13 @@ } long long -reg_reverse (long long x) +reg_reverse_1 (long long x) { return __builtin_bswap64 (x); } + +long long +reg_reverse_2 (long long dummy __attribute__ ((unused)), long long x) +{ + return __builtin_bswap64 (x); +} -- Alan Modra Australia Development Lab, IBM