On Thu, May 21, 2015 at 08:06:04PM +0930, Alan Modra wrote: > FAIL: gcc.target/powerpc/ti_math1.c scan-assembler-times adde 1 > is seen on powerpc64le-linux since somewhere between revision 218587 > and 218616. See > https://gcc.gnu.org/ml/gcc-testresults/2014-12/msg01287.html and > https://gcc.gnu.org/ml/gcc-testresults/2014-12/msg01325.html > > A regression hunt fingers one of Segher's 2014-12-10 patches to the > rs6000 backend, git commit 0f1bedb4 or svn rev 218595.
It doesn't trigger on big-endian; what is different? The order of tried combinations I suppose, because the loads are swapped? > Segher might > argue that generated code is better after this commit, and I'd agree > that his change is a good one in general, but even so it would be nice > to generate the ideal code. Yes (to all of it). It was a big rip-up, left better generated code than we had before, and simplified the rs6000 backend code. Tuning it again now can only make things better :-) > Curiously, the ideal code is generated at > -O1, but we regress at -O2.. Skipping all the way to the end of your mail, that's because of flag_expensive_optimizations (if nothing else). > before after ideal (-O1) > add_128: add_128: add_128: > ld 10,0(3) ld 9,0(3) ld 9,0(3) > ld 11,8(3) ld 10,8(3) ld 10,8(3) > addc 8,4,10 addc 3,4,9 addc 3,4,9 > adde 9,5,11 addze 5,5 adde 4,5,10 > mr 3,8 add 4,5,10 blr > mr 4,9 blr > blr > > I went looking into where the addze appeared, and found combine. That wasn't a big surprise I hope :-) > Trying 18, 9 -> 24: > Failed to match this instruction: > (set (reg:DI 4 4 [+8 ]) > (plus:DI (plus:DI (reg:DI 5 5 [ val+8 ]) > (reg:DI 76 ca)) > (reg:DI 169 [+8 ]))) For some reason it has the CA reg not last. I think we should add to the canonicalisation rules so that fixed regs sort after other regs. That requires a lot of testing. For even slightly less trivial code, combine usually tries something in the "correct" order as well, btw, which is why the current code works as well as it does. > Successfully matched this instruction: > (set (reg:DI 167 [ D.2366+8 ]) > (plus:DI (reg:DI 5 5 [ val+8 ]) > (reg:DI 76 ca))) > Successfully matched this instruction: > (set (reg:DI 4 4 [+8 ]) > (plus:DI (reg:DI 167 [ D.2366+8 ]) > (reg:DI 169 [+8 ]))) > allowing combination of insns 18, 9 and 24 > original costs 4 + 8 + 4 = 16 > replacement costs 4 + 4 = 8 Still need to fix the costs as well (but they work as-is; well enough that is). > Here are the three insns involved, sans source line numbers and notes. > > (insn 18 17 4 2 (set (reg:DI 165 [ val+8 ]) > (reg:DI 5 5 [ val+8 ])) {*movdi_internal64}) > ... > (insn 9 8 23 2 (parallel [ > (set (reg:DI 167 [ D.2366+8 ]) > (plus:DI (plus:DI (reg:DI 165 [ val+8 ]) > (reg:DI 169 [+8 ])) > (reg:DI 76 ca))) > (clobber (reg:DI 76 ca)) > ]) {*adddi3_carry_in_internal}) > ... > (insn 24 23 15 2 (set (reg:DI 4 4 [+8 ]) > (reg:DI 167 [ D.2366+8 ])) {*movdi_internal64}) > > So, a move copying an argument register to a pseudo, one insn from the > body of the function, and a move copying a pseudo to a result > register. The thought I had was: It is really combine's business to > look at copies from/to ABI mandated hard registers? Isn't removing > the copies something that register allocation can do better? Yes, a well-known problem, and one I intended to fix for GCC 6. > If so, then combine is doing unnecessary work. Not only that, it pessimises generated code (as you see here; but more often it prevents optimal register allocation because it already has put a hard reg somewhere). > As a quick hack, I tried the following. > > Index: gcc/combine.c > =================================================================== > --- gcc/combine.c (revision 223431) > +++ gcc/combine.c (working copy) > @@ -1281,6 +1281,16 @@ combine_instructions (rtx_insn *f, unsigned int nr > if (!NONDEBUG_INSN_P (insn)) > continue; > > + if (this_basic_block == EXIT_BLOCK_PTR_FOR_FN (cfun)->prev_bb) Are these copies guaranteed to (still) be in this basic block, after the passes before combine? Did those passes do anything to prevent moving it? I'm asking because it would be good to use the same conditions in that case. > + { > + rtx set = single_set (insn); > + if (set > + && REG_P (SET_DEST (set)) > + && HARD_REGISTER_P (SET_DEST (set)) > + && REG_P (SET_SRC (set))) > + continue; > + } > + > while (last_combined_insn > && last_combined_insn->deleted ()) > last_combined_insn = PREV_INSN (last_combined_insn); > > This cures the powerpc64le testcase failure, but Segher said on irc I > was risking breaking x86 and other targets. If you don't combine any of the parameter passing insns, yes. Where "break" means just generate worse code, of course. Here you look only at copying a pseudo to a hard reg (not to memory or similar). That might be good enough in the exit block. (Needs a code comment). > Perhaps that was trying > to push me to fix the underlying combine problem. :) Good :-) > In any case, I didn't believe him, Even better! > and tested the patch on powerpc64le-linux and > x86_64-linux. No regressions in --languages=all,go and objdump -d > comparison for gcc/*.o against virgin source show no unexpected > changes. powerpc64le-linux actually shows no changes at all apart > from combine.o while x86_64-linux shows some changes in register > allocation and cmove arg swapping with inversion of the condition. > There were no extra instructions. > > So, is this worth pursuing in order to speed up combine? ... and make it generate better code. Yes, please do. Could you also look at paramter passing though (as opposed to return value passing)? It is much more important. > I'd be > inclined to patch create_log_links instead for a proper patch. Yes. > Incidentally the underlying problem in combine (well the first one I > spotted, there might be more), is that > if (flag_expensive_optimizations) > { > /* Pass pc_rtx so no substitutions are done, just > simplifications. */ > "simplifies" this i2src > > (plus:DI (plus:DI (reg:DI 165 [ val+8 ]) > (reg:DI 169 [+8 ])) > (reg:DI 76 ca)) > > to this > > (plus:DI (plus:DI (reg:DI 76 ca) > (reg:DI 165 [ val+8 ])) > (reg:DI 169 [+8 ])) > > and the latter has the ca register in the wrong place. So a split is > tried and you get addze. I'm working on this. The reordering happens > inside simplify_plus_minus. Ooh, something else I don't have to look at (yet)! Thank you Alan, Segher