FAIL: gcc.target/powerpc/ti_math1.c scan-assembler-times adde 1 is seen on powerpc64le-linux since somewhere between revision 218587 and 218616. See https://gcc.gnu.org/ml/gcc-testresults/2014-12/msg01287.html and https://gcc.gnu.org/ml/gcc-testresults/2014-12/msg01325.html
A regression hunt fingers one of Segher's 2014-12-10 patches to the rs6000 backend, git commit 0f1bedb4 or svn rev 218595. Segher might argue that generated code is better after this commit, and I'd agree that his change is a good one in general, but even so it would be nice to generate the ideal code. Curiously, the ideal code is generated at -O1, but we regress at -O2.. before after ideal (-O1) add_128: add_128: add_128: ld 10,0(3) ld 9,0(3) ld 9,0(3) ld 11,8(3) ld 10,8(3) ld 10,8(3) addc 8,4,10 addc 3,4,9 addc 3,4,9 adde 9,5,11 addze 5,5 adde 4,5,10 mr 3,8 add 4,5,10 blr mr 4,9 blr blr I went looking into where the addze appeared, and found combine. Trying 18, 9 -> 24: Failed to match this instruction: (set (reg:DI 4 4 [+8 ]) (plus:DI (plus:DI (reg:DI 5 5 [ val+8 ]) (reg:DI 76 ca)) (reg:DI 169 [+8 ]))) Successfully matched this instruction: (set (reg:DI 167 [ D.2366+8 ]) (plus:DI (reg:DI 5 5 [ val+8 ]) (reg:DI 76 ca))) Successfully matched this instruction: (set (reg:DI 4 4 [+8 ]) (plus:DI (reg:DI 167 [ D.2366+8 ]) (reg:DI 169 [+8 ]))) allowing combination of insns 18, 9 and 24 original costs 4 + 8 + 4 = 16 replacement costs 4 + 4 = 8 Here are the three insns involved, sans source line numbers and notes. (insn 18 17 4 2 (set (reg:DI 165 [ val+8 ]) (reg:DI 5 5 [ val+8 ])) {*movdi_internal64}) ... (insn 9 8 23 2 (parallel [ (set (reg:DI 167 [ D.2366+8 ]) (plus:DI (plus:DI (reg:DI 165 [ val+8 ]) (reg:DI 169 [+8 ])) (reg:DI 76 ca))) (clobber (reg:DI 76 ca)) ]) {*adddi3_carry_in_internal}) ... (insn 24 23 15 2 (set (reg:DI 4 4 [+8 ]) (reg:DI 167 [ D.2366+8 ])) {*movdi_internal64}) So, a move copying an argument register to a pseudo, one insn from the body of the function, and a move copying a pseudo to a result register. The thought I had was: It is really combine's business to look at copies from/to ABI mandated hard registers? Isn't removing the copies something that register allocation can do better? If so, then combine is doing unnecessary work. As a quick hack, I tried the following. Index: gcc/combine.c =================================================================== --- gcc/combine.c (revision 223431) +++ gcc/combine.c (working copy) @@ -1281,6 +1281,16 @@ combine_instructions (rtx_insn *f, unsigned int nr if (!NONDEBUG_INSN_P (insn)) continue; + if (this_basic_block == EXIT_BLOCK_PTR_FOR_FN (cfun)->prev_bb) + { + rtx set = single_set (insn); + if (set + && REG_P (SET_DEST (set)) + && HARD_REGISTER_P (SET_DEST (set)) + && REG_P (SET_SRC (set))) + continue; + } + while (last_combined_insn && last_combined_insn->deleted ()) last_combined_insn = PREV_INSN (last_combined_insn); This cures the powerpc64le testcase failure, but Segher said on irc I was risking breaking x86 and other targets. Perhaps that was trying to push me to fix the underlying combine problem. :) In any case, I didn't believe him, and tested the patch on powerpc64le-linux and x86_64-linux. No regressions in --languages=all,go and objdump -d comparison for gcc/*.o against virgin source show no unexpected changes. powerpc64le-linux actually shows no changes at all apart from combine.o while x86_64-linux shows some changes in register allocation and cmove arg swapping with inversion of the condition. There were no extra instructions. So, is this worth pursuing in order to speed up combine? I'd be inclined to patch create_log_links instead for a proper patch. Incidentally the underlying problem in combine (well the first one I spotted, there might be more), is that if (flag_expensive_optimizations) { /* Pass pc_rtx so no substitutions are done, just simplifications. */ "simplifies" this i2src (plus:DI (plus:DI (reg:DI 165 [ val+8 ]) (reg:DI 169 [+8 ])) (reg:DI 76 ca)) to this (plus:DI (plus:DI (reg:DI 76 ca) (reg:DI 165 [ val+8 ])) (reg:DI 169 [+8 ])) and the latter has the ca register in the wrong place. So a split is tried and you get addze. I'm working on this. The reordering happens inside simplify_plus_minus. -- Alan Modra Australia Development Lab, IBM