Segher Boessenkool <seg...@kernel.crashing.org> writes: >> /* Make sure that the value that is to be substituted for the register >> does not use any registers whose values alter in between. However, >> If the insns are adjacent, a use can't cross a set even though we >> think it might (this can happen for a sequence of insns each setting >> the same destination; last_set of that register might point to >> a NOTE). If INSN has a REG_EQUIV note, the register is always >> equivalent to the memory so the substitution is valid even if there >> are intervening stores. Also, don't move a volatile asm or >> UNSPEC_VOLATILE across any other insns. */ >> || (! all_adjacent >> && (((!MEM_P (src) >> || ! find_reg_note (insn, REG_EQUIV, src)) >> && modified_between_p (src, insn, i3)) >> || (GET_CODE (src) == ASM_OPERANDS && MEM_VOLATILE_P (src)) >> || GET_CODE (src) == UNSPEC_VOLATILE)) > > So this would work if you had pseudos here, instead of the hard reg? > Because it is a hard reg it is the same number in both places, making it > hard to move.
Yeah, probably. But the hard reg is a critical part of this. Going back to the example: (set (reg/v:VNx16BI 102 [ ok ]) (reg:VNx16BI 85 ffrt)) (set (reg:VNx16BI 85 ffrt) (unspec:VNx16BI [(reg:VNx16BI 85 ffrt)] UNSPEC_UPDATE_FFRT)) (set (reg:CC_NZC 66 cc) (unspec:CC_NZC [(reg:VNx16BI 106) repeated x2 (const_int 1 [0x1]) (reg/v:VNx16BI 102 [ ok ])] UNSPEC_PTEST)) FFR is the real first fault register. FFRT is actually a fake register whose only purpose is to describe the dependencies (in rtl) between writes to the FFR, reads from the FFR and first-faulting loads. The whole scheme depends on having only one fixed FFRT register. >> > How are dependencies represented in your new pass? If it just does >> > walks over the insn stream for everything, you get quadratic complexity >> > if you move insns backwards. We have that in combine already, mostly >> > from modified_between_p, but that is limited because of how LOG_LINKS >> > work, and we have been doing this for so long and there are no problems >> > found with it, so it must work in practice. But I am worried about it >> > when moving insns back an unlimited distance. >> >> It builds def-use chains, but using a constant limit on the number of >> explicitly-recorded uses. All other uses go in a numerical live range >> from which they (conservatively) never escape. The def-use chains >> represent memory as a single entity, a bit like in gimple. > > Ah. So that range thing ensures correctness. Yeah. > Why don't you use DF for the DU chains? The problem with DF_DU_CHAIN is that it's quadratic in the worst case. fwprop.c gets around that by using the MD problem and having its own dominator walker to calculate limited def-use chains: /* We use the multiple definitions problem to compute our restricted use-def chains. */ So taking that approach here would still require some amount of roll-your-own. Other reasons are: * Even what fwprop does is more elaborate than we need for now. * We need to handle memory too, and it's nice to be able to handle it in the same way as registers. * Updating a full, ordered def-use chain after a move is a linear-time operation, so whatever happens, we'd need to apply some kind of limit on the number of uses we maintain, with something like that integer point range for the rest. * Once we've analysed the insn and built its def-use chains, we don't look at the df_refs again until we update the chains after a successful combination. So it should be more efficient to maintain a small array of insn_info_rec pointers alongside the numerical range, rather than walk and pollute chains of df_refs and then link back the insn uids to the pass-local info. >> >> - it tries using REG_EQUAL notes for the final instruction. >> > >> > And that. >> >> I meant REG_EQUAL notes on i3, i.e. it tries replacing the src of i3 >> with i3's REG_EQUAL note and combining into that. Does combine do that? >> I couldn't see it, and in: >> >> https://gcc.gnu.org/ml/gcc/2019-06/msg00148.html >> >> you seemed to reject the idea of allowing it. > > Yes, I still do. Do you have an example where it helps? I'll run another set of tests for that. >> >> - it can parallelise two independent instructions that both read from >> >> the same register or both read from memory. >> > >> > That only if somehow there is a link between the two (so essentially >> > never). The only combinations tried by combine are those via LOG_LINKs, >> > which are between a SET and the first corresponding use. This is a key >> > factor that makes it kind of linear (instead of exponential) complexity. >> >> Tracking limited def-use chains is what makes this last bit easy. >> We can just try parallelising two instructions from the (bounded) list >> of uses. And for this case there's not any garbage rtl involved, since >> we reuse the same PARALLEL rtx between attempts. The cost is basically >> all in the recog call (which would obviously mount up if we went >> overboard). > > *All* examples above and below are just this. Yeah, the powerpc and s390x examples were. The motivating FFR example above isn't though: it's a def-use combination in parallel with the existing definition. > If you disable everything else, what do the statistics look like then? Had no idea how this would turn out -- which is a good sign it was worth doing -- but: results below. >> > One thing I want to do is some mini-combine after every split, probably >> > only with the insns new from the split. But we have no cfglayout mode >> > anymore then, and only hard regs (except in the first split pass, which >> > is just a little later than your new pass). >> >> Yeah, sounds like it could be useful. I guess there'd need to be >> an extra condition on the combination that the new insn can't be >> immediately split. > > It would run *after* split. Not interleaved with it. Yeah. But what I meant was: a lot of insns that are split after reload are combined for RA purposes and the split form is really the preferred form (especially for scheduling). So if we have a combine pass *after* split, I think it should avoid using any combination that matches a split. >> > And amount of garbage produced? >> >> If -ftime-report stats are accurate, then the total amount of >> memory allocated is: >> >> run-combine=2 (normal combine): 1793 kB >> run-combine=4 (new pass only): 98 kB >> run-combine=6 (both passes): 1871 kB (new pass accounts for 78 kB) >> >> But again that's not a fair comparison when the main combine pass does more. > > The way combine does SUBST is pretty fundamental to how it works (it can > be ripped out, and probably we'll have to at some point, but that will be > very invasive). Originally all this temporary RTL was on obstacks and > reaping it was cheap, but everything is GCed now (fixing the bugs was not > cheap :-) ) Yeah, I remember :-) > If you look at even really bad cases, combine is still only a few > percent of total, so it isn't too bad. > >> I did try hard to keep the amount of garbage rtl down though. This is >> why I added validate_simplify_replace_rtx rather than trying to make >> do with existing routines. It should only create new rtl if the >> simplification routines did something useful. (Of course, that's mostly >> true of combine as well, but things like the make_compound_operation/ >> expand_compound_operation wrangler can create expressions that are never >> actually useful.) > > Don't mention those, thanks :-) > >> >> To get a feel for the effect on multiple targets, I did my usual >> >> bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg >> >> and g++.dg, this time comparing run-combine=2 and run-combine=6 >> >> using -O2 -ftree-vectorize: >> > >> > One problem with this is that these are very short functions on average. >> >> There are some long ones too :-) > > Yes, but this isn't a good stand-in for representative programs. Right. And number of lines of asm isn't a good stand-in for anything much. Like I say, the whole thing is just to get a feel, on tests that are readily to hand and are easy to compile without a full toolchain. >> > What is the kind of changes you see for other targets? >> >> On powerpc64le-linux-gnu it mostly comes from eliminating comparisons >> in favour of other flag-setting instructions and making more use of >> post-increments. Not sure the last one is actually a win, but the >> target costs say it's OK :-). E.g. from gcc.c-torture/execute/pr78675.c: >> >> @@ -48,9 +48,8 @@ >> blr >> .align 4 >> .L19: >> - cmpdi 0,10,0 >> + mr. 9,10 >> mr 3,8 >> - mr 9,10 >> bne 0,.L9 >> b .L3 >> .align 4 > > Okay, so this combining two uses of r10 into one insn. > > This isn't necessarily a good idea: the combined insn cannot be moved as > much as one of its components could, which can also immediately prevent > further combinations. > > But doing this after combine, as you do, is probably beneficial. > >> and a slightly more interesting example in gcc.c-torture/execute/loop-6.c: > > This is the same thing (we do andi. a,b,0xff instead of rlwinm. a,b,0,0xff > because this is cheaper on p7 and p8). > >> gcc.c-torture/execute/20081218-1.c is an example where we make more use >> of post-increment: >> >> .L9: >> - lbz 10,1(9) >> - addi 9,9,1 >> + lbzu 10,1(9) >> cmpwi 0,10,38 >> bne 0,.L8 >> - lbz 10,1(9) >> - addi 9,9,1 >> + lbzu 10,1(9) >> cmpwi 0,10,38 >> bne 0,.L8 >> bdnz .L9 > > Pre-increment (we only *have* pre-modify memory accesses). Oops, yes. >> /* Mimic combine's behavior by not combining moves from allocatable hard >> registers (e.g. when copying parameters or function return values). */ >> if (REG_P (src) && HARD_REGISTER_P (src) && !fixed_regs[REGNO (src)]) >> return false; >> >> Although if that could have accounted for the difference, it sounds like >> we're leaving a lot on the table by doing this :-) > > It actually helps (and quite a bit). But if your test cases are mainly > tiny functions, anything can happen. But since you see this across all > targets, it must be doing something good :-) > > > So I'd love to see statistics for *only* combining two uses of the same > thing, this is something combine cannot do, and arguably *shouldn't* do! OK, here are two sets of results. The first is for: (A) --param run-combine=2 (current combine only) (B) --param run-combine=6 (both passes), use-use combinations only Target Tests Delta Best Worst Median ====== ===== ===== ==== ===== ====== aarch64-linux-gnu 158 3060 -72 520 -1 aarch64_be-linux-gnu 111 24 -57 324 -1 alpha-linux-gnu 3 3 1 1 1 amdgcn-amdhsa 18 71 -17 26 1 arc-elf 310 -4414 -1516 356 1 arm-linux-gnueabi 28 -50 -13 3 -1 arm-linux-gnueabihf 28 -50 -13 3 -1 avr-elf 26 308 -1 36 12 bfin-elf 6 8 -1 3 1 bpf-elf 10 21 -1 6 1 c6x-elf 7 9 -6 6 1 cr16-elf 13 102 1 27 2 cris-elf 35 -1001 -700 3 -2 csky-elf 9 28 1 6 2 epiphany-elf 29 -29 -2 1 -1 fr30-elf 12 17 -1 5 1 frv-linux-gnu 1 -2 -2 -2 -2 ft32-elf 10 22 -1 5 2 h8300-elf 29 56 -22 14 2 hppa64-hp-hpux11.23 9 17 -1 4 2 i686-apple-darwin 10 -33 -20 12 -2 i686-pc-linux-gnu 41 243 -12 33 3 ia64-linux-gnu 28 -32 -29 39 -4 iq2000-elf 6 8 1 2 1 lm32-elf 10 12 -3 5 1 m32r-elf 3 2 -2 2 2 m68k-linux-gnu 19 27 -2 5 2 mcore-elf 14 23 -10 6 2 microblaze-elf 5 5 1 1 1 mipsel-linux-gnu 9 12 -5 6 2 mipsisa64-linux-gnu 7 1 -3 1 1 mmix 6 6 -2 4 1 mn10300-elf 20 15 -4 5 1 moxie-rtems 8 11 -2 3 1 msp430-elf 8 24 1 6 2 nds32le-elf 91 -188 -24 136 -1 nios2-linux-gnu 2 6 1 5 1 nvptx-none 396 756 1 16 1 or1k-elf 8 20 1 4 2 pdp11 65 149 -10 45 2 powerpc-ibm-aix7.0 1039 1114 -366 2124 -1 powerpc64-linux-gnu 854 2753 -274 3094 -2 powerpc64le-linux-gnu 648 -551 -340 208 -1 pru-elf 5 5 -2 3 1 riscv32-elf 7 6 -2 5 1 riscv64-elf 2 5 2 3 2 rl78-elf 80 -648 -98 13 -4 rx-elf 16 2 -4 5 -1 s390-linux-gnu 60 -174 -39 14 -1 s390x-linux-gnu 152 -781 -159 14 -1 sh-linux-gnu 13 5 -15 7 1 sparc-linux-gnu 29 7 -3 11 -1 sparc64-linux-gnu 51 1 -8 15 -1 tilepro-linux-gnu 119 -567 -164 15 -2 v850-elf 4 4 -1 3 1 vax-netbsdelf 10 13 -4 5 1 visium-elf 4 0 -5 3 1 x86_64-darwin 7 -12 -9 4 -2 x86_64-linux-gnu 6 -11 -6 4 -2 xstormy16-elf 10 13 1 2 1 xtensa-elf 6 8 -1 2 2 which definitely shows up some outliers I need to look at. The second set is for: (B) --param run-combine=6 (both passes), use-use combinations only (C) --param run-combine=6 (both passes), no restrictions Target Tests Delta Best Worst Median ====== ===== ===== ==== ===== ====== aarch64-linux-gnu 272 -3844 -585 18 -1 aarch64_be-linux-gnu 190 -3336 -370 18 -1 alpha-linux-gnu 401 -2735 -370 22 -2 amdgcn-amdhsa 188 1867 -484 1259 -1 arc-elf 257 -1498 -650 54 -1 arm-linux-gnueabi 168 -1117 -612 680 -1 arm-linux-gnueabihf 168 -1117 -612 680 -1 avr-elf 1341 -111401 -13824 680 -10 bfin-elf 1346 -18950 -8461 465 -2 bpf-elf 63 -496 -60 3 -2 c6x-elf 179 -10527 -10084 41 -2 cr16-elf 1616 -51479 -10657 42 -13 cris-elf 113 -533 -84 4 -2 csky-elf 129 -3399 -474 1 -2 epiphany-elf 151 -375 -149 84 -1 fr30-elf 155 -1773 -756 289 -2 frv-linux-gnu 808 -13332 -2074 67 -1 ft32-elf 276 -1688 -111 -1 -2 h8300-elf 527 -11522 -1747 68 -3 hppa64-hp-hpux11.23 179 -865 -142 34 -1 i686-apple-darwin 335 -1266 -56 44 -1 i686-pc-linux-gnu 222 -2216 -556 32 -1 ia64-linux-gnu 122 -4793 -1134 40 -5 iq2000-elf 171 -1341 -61 3 -2 lm32-elf 187 -1814 -316 47 -2 m32r-elf 70 -597 -98 11 -2 m68k-linux-gnu 197 -2375 -332 148 -2 mcore-elf 125 -1236 -146 7 -1 microblaze-elf 442 -4498 -2094 32 -2 mipsel-linux-gnu 125 -2050 -222 60 -2 mipsisa64-linux-gnu 107 -2015 -130 14 -2 mmix 103 -239 -26 4 -1 mn10300-elf 215 -1039 -234 80 -1 moxie-rtems 149 -754 -79 4 -2 msp430-elf 180 -600 -63 19 -1 nds32le-elf 183 -287 -37 32 -1 nios2-linux-gnu 81 -329 -66 4 -1 nvptx-none 200 -1882 -208 -2 -2 or1k-elf 57 -317 -25 2 -1 pdp11 207 -1441 -182 83 -2 powerpc-ibm-aix7.0 400 -4145 -271 14 -2 powerpc64-linux-gnu 375 -2062 -160 117 -2 powerpc64le-linux-gnu 491 -4169 -700 156 -2 pru-elf 47 -7020 -6921 6 -1 riscv32-elf 59 -1379 -139 7 -2 riscv64-elf 89 -1562 -264 7 -1 rl78-elf 289 -16157 -1665 42 -6 rx-elf 82 -195 -53 8 -1 s390-linux-gnu 128 -2108 -1485 63 -1 s390x-linux-gnu 112 418 -32 522 -1 sh-linux-gnu 218 -410 -108 68 -1 sparc-linux-gnu 141 -866 -99 18 -1 sparc64-linux-gnu 129 -792 -102 3 -2 tilepro-linux-gnu 953 -4331 -297 332 -2 v850-elf 50 -412 -53 2 -3 vax-netbsdelf 254 -3328 -400 4 -2 visium-elf 100 -693 -138 16 -1 x86_64-darwin 345 -2134 -490 72 -1 x86_64-linux-gnu 307 -843 -288 210 -1 xstormy16-elf 218 -788 -156 59 -1 xtensa-elf 195 -1426 -322 36 1 So the main benefit does seem to come from the def-use part. Here are some powerpc64le-linux-gnu examples of (B)->(C): gcc.c-torture/execute/20171008-1.c: @@ -79,8 +79,7 @@ stdu 1,-32(1) .LCFI5: bl foo - rlwinm 3,3,0,0xff - cmpwi 0,3,0 + andi. 9,3,0xff bne 0,.L13 addi 1,1,32 gcc.c-torture/execute/pr28982a.c: @@ -427,15 +427,13 @@ stxvd2x 0,7,6 .align 4 .L9: - xxlor 12,32,32 + xvcvsxwsp 0,32 vadduwm 0,0,1 addi 10,9,16 - xvcvsxwsp 0,12 - xxlor 12,32,32 + stxvd2x 0,0,9 + xvcvsxwsp 0,32 + addi 9,9,32 vadduwm 0,0,1 - stxvd2x 0,0,9 - xvcvsxwsp 0,12 - addi 9,9,32 stxvd2x 0,0,10 bdnz .L9 li 3,4 (Disclaimer: I have no idea if that's correct.) gcc.c-torture/execute/pr65215-3.c: @@ -56,11 +56,10 @@ srdi 10,3,32 srdi 9,3,56 slwi 6,10,24 - srwi 7,10,8 + rlwinm 7,10,24,16,23 or 9,9,6 - rlwinm 7,7,0,16,23 + rlwinm 10,10,8,8,15 or 9,9,7 - rlwinm 10,10,8,8,15 or 9,9,10 cmpw 0,9,8 bne 0,.L4 Just to emphasise though: I'm not proposing that we switch this on for all targets yet. It would be opt-in until the pass is more mature. But that FFR case is really important for the situation it handles. Thanks, Richard