On Wed, Nov 20, 2019 at 06:20:34PM +0000, Richard Sandiford wrote: > Segher Boessenkool <seg...@kernel.crashing.org> writes: > > So this would work if you had pseudos here, instead of the hard reg? > > Because it is a hard reg it is the same number in both places, making it > > hard to move. > > Yeah, probably. But the hard reg is a critical part of this. > Going back to the example: > > (set (reg/v:VNx16BI 102 [ ok ]) > (reg:VNx16BI 85 ffrt)) > (set (reg:VNx16BI 85 ffrt) > (unspec:VNx16BI [(reg:VNx16BI 85 ffrt)] UNSPEC_UPDATE_FFRT)) > (set (reg:CC_NZC 66 cc) > (unspec:CC_NZC > [(reg:VNx16BI 106) repeated x2 > (const_int 1 [0x1]) > (reg/v:VNx16BI 102 [ ok ])] UNSPEC_PTEST)) > > FFR is the real first fault register. FFRT is actually a fake register > whose only purpose is to describe the dependencies (in rtl) between writes > to the FFR, reads from the FFR and first-faulting loads. The whole scheme > depends on having only one fixed FFRT register.
Right. The reason this cannot work in combine is that combine always combines to just *one* insn, at i3; later, if it turns out that it needs to split it, it can put something at i2. But that doesn't even happen here, only the first and the last of those three insns are what is combined. It is important combine only moves things forward in the insn stream, to make sure this whole process is finite. Or this was true years ago, at least :-) > > Why don't you use DF for the DU chains? > > The problem with DF_DU_CHAIN is that it's quadratic in the worst case. Oh, wow. > fwprop.c gets around that by using the MD problem and having its own > dominator walker to calculate limited def-use chains: > > /* We use the multiple definitions problem to compute our restricted > use-def chains. */ It's not great if every pass invents its own version of some common infrastructure thing because that common one is not suitable. I.e., can this be fixed somehow? Maybe just by having a restricted DU chains df problem? > So taking that approach here would still require some amount of > roll-your-own. Other reasons are: > > * Even what fwprop does is more elaborate than we need for now. > > * We need to handle memory too, and it's nice to be able to handle > it in the same way as registers. > > * Updating a full, ordered def-use chain after a move is a linear-time > operation, so whatever happens, we'd need to apply some kind of limit > on the number of uses we maintain, with something like that integer > point range for the rest. > > * Once we've analysed the insn and built its def-use chains, we don't > look at the df_refs again until we update the chains after a successful > combination. So it should be more efficient to maintain a small array > of insn_info_rec pointers alongside the numerical range, rather than > walk and pollute chains of df_refs and then link back the insn uids > to the pass-local info. So you need something like combine's LOG_LINKS? Not that handling those is not quadratic in the worst case, but in practice it works well. And it *could* be made linear. > >> Tracking limited def-use chains is what makes this last bit easy. > >> We can just try parallelising two instructions from the (bounded) list > >> of uses. And for this case there's not any garbage rtl involved, since > >> we reuse the same PARALLEL rtx between attempts. The cost is basically > >> all in the recog call (which would obviously mount up if we went > >> overboard). > > > > *All* examples above and below are just this. > > Yeah, the powerpc and s390x examples were. The motivating FFR example > above isn't though: it's a def-use combination in parallel with the > existing definition. Right, good point :-) > >> >> To get a feel for the effect on multiple targets, I did my usual > >> >> bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg > >> >> and g++.dg, this time comparing run-combine=2 and run-combine=6 > >> >> using -O2 -ftree-vectorize: > >> > > >> > One problem with this is that these are very short functions on average. > >> > >> There are some long ones too :-) > > > > Yes, but this isn't a good stand-in for representative programs. > > Right. And number of lines of asm isn't a good stand-in for anything much. For combine, number of insns generated is a surprisingly good measure of how it performed. Sometimes not, when it goes over a border of an inlining decision, say, or bb-reorder decides to duplicate more because it is cheaper now. > Like I say, the whole thing is just to get a feel, on tests that are readily > to hand and are easy to compile without a full toolchain. Absolutely. But I have no experience with using your test set, so the numbers do not necessarily mean so much to me :-) > > So I'd love to see statistics for *only* combining two uses of the same > > thing, this is something combine cannot do, and arguably *shouldn't* do! > > OK, here are two sets of results. The first is for: > > (A) --param run-combine=2 (current combine only) > (B) --param run-combine=6 (both passes), use-use combinations only > > Target Tests Delta Best Worst Median > ====== ===== ===== ==== ===== ====== > aarch64-linux-gnu 158 3060 -72 520 -1 > aarch64_be-linux-gnu 111 24 -57 324 -1 > alpha-linux-gnu 3 3 1 1 1 > amdgcn-amdhsa 18 71 -17 26 1 > arc-elf 310 -4414 -1516 356 1 > arm-linux-gnueabi 28 -50 -13 3 -1 > arm-linux-gnueabihf 28 -50 -13 3 -1 > avr-elf 26 308 -1 36 12 > bfin-elf 6 8 -1 3 1 > bpf-elf 10 21 -1 6 1 > c6x-elf 7 9 -6 6 1 > cr16-elf 13 102 1 27 2 > cris-elf 35 -1001 -700 3 -2 > csky-elf 9 28 1 6 2 > epiphany-elf 29 -29 -2 1 -1 > fr30-elf 12 17 -1 5 1 > frv-linux-gnu 1 -2 -2 -2 -2 > ft32-elf 10 22 -1 5 2 > h8300-elf 29 56 -22 14 2 > hppa64-hp-hpux11.23 9 17 -1 4 2 > i686-apple-darwin 10 -33 -20 12 -2 > i686-pc-linux-gnu 41 243 -12 33 3 > ia64-linux-gnu 28 -32 -29 39 -4 > iq2000-elf 6 8 1 2 1 > lm32-elf 10 12 -3 5 1 > m32r-elf 3 2 -2 2 2 > m68k-linux-gnu 19 27 -2 5 2 > mcore-elf 14 23 -10 6 2 > microblaze-elf 5 5 1 1 1 > mipsel-linux-gnu 9 12 -5 6 2 > mipsisa64-linux-gnu 7 1 -3 1 1 > mmix 6 6 -2 4 1 > mn10300-elf 20 15 -4 5 1 > moxie-rtems 8 11 -2 3 1 > msp430-elf 8 24 1 6 2 > nds32le-elf 91 -188 -24 136 -1 > nios2-linux-gnu 2 6 1 5 1 > nvptx-none 396 756 1 16 1 > or1k-elf 8 20 1 4 2 > pdp11 65 149 -10 45 2 > powerpc-ibm-aix7.0 1039 1114 -366 2124 -1 > powerpc64-linux-gnu 854 2753 -274 3094 -2 > powerpc64le-linux-gnu 648 -551 -340 208 -1 > pru-elf 5 5 -2 3 1 > riscv32-elf 7 6 -2 5 1 > riscv64-elf 2 5 2 3 2 > rl78-elf 80 -648 -98 13 -4 > rx-elf 16 2 -4 5 -1 > s390-linux-gnu 60 -174 -39 14 -1 > s390x-linux-gnu 152 -781 -159 14 -1 > sh-linux-gnu 13 5 -15 7 1 > sparc-linux-gnu 29 7 -3 11 -1 > sparc64-linux-gnu 51 1 -8 15 -1 > tilepro-linux-gnu 119 -567 -164 15 -2 > v850-elf 4 4 -1 3 1 > vax-netbsdelf 10 13 -4 5 1 > visium-elf 4 0 -5 3 1 > x86_64-darwin 7 -12 -9 4 -2 > x86_64-linux-gnu 6 -11 -6 4 -2 > xstormy16-elf 10 13 1 2 1 > xtensa-elf 6 8 -1 2 2 > > which definitely shows up some outliers I need to look at. Yeah, huh, it's all over the map. > The second set is for: > > (B) --param run-combine=6 (both passes), use-use combinations only > (C) --param run-combine=6 (both passes), no restrictions > > Target Tests Delta Best Worst Median > ====== ===== ===== ==== ===== ====== > aarch64-linux-gnu 272 -3844 -585 18 -1 > aarch64_be-linux-gnu 190 -3336 -370 18 -1 > alpha-linux-gnu 401 -2735 -370 22 -2 > amdgcn-amdhsa 188 1867 -484 1259 -1 > arc-elf 257 -1498 -650 54 -1 > arm-linux-gnueabi 168 -1117 -612 680 -1 > arm-linux-gnueabihf 168 -1117 -612 680 -1 > avr-elf 1341 -111401 -13824 680 -10 Things like this are kind of suspicious :-) > bfin-elf 1346 -18950 -8461 465 -2 > bpf-elf 63 -496 -60 3 -2 > c6x-elf 179 -10527 -10084 41 -2 > cr16-elf 1616 -51479 -10657 42 -13 > cris-elf 113 -533 -84 4 -2 > csky-elf 129 -3399 -474 1 -2 > epiphany-elf 151 -375 -149 84 -1 > fr30-elf 155 -1773 -756 289 -2 > frv-linux-gnu 808 -13332 -2074 67 -1 > ft32-elf 276 -1688 -111 -1 -2 > h8300-elf 527 -11522 -1747 68 -3 > hppa64-hp-hpux11.23 179 -865 -142 34 -1 > i686-apple-darwin 335 -1266 -56 44 -1 > i686-pc-linux-gnu 222 -2216 -556 32 -1 > ia64-linux-gnu 122 -4793 -1134 40 -5 > iq2000-elf 171 -1341 -61 3 -2 > lm32-elf 187 -1814 -316 47 -2 > m32r-elf 70 -597 -98 11 -2 > m68k-linux-gnu 197 -2375 -332 148 -2 > mcore-elf 125 -1236 -146 7 -1 > microblaze-elf 442 -4498 -2094 32 -2 > mipsel-linux-gnu 125 -2050 -222 60 -2 > mipsisa64-linux-gnu 107 -2015 -130 14 -2 > mmix 103 -239 -26 4 -1 > mn10300-elf 215 -1039 -234 80 -1 > moxie-rtems 149 -754 -79 4 -2 > msp430-elf 180 -600 -63 19 -1 > nds32le-elf 183 -287 -37 32 -1 > nios2-linux-gnu 81 -329 -66 4 -1 > nvptx-none 200 -1882 -208 -2 -2 > or1k-elf 57 -317 -25 2 -1 > pdp11 207 -1441 -182 83 -2 > powerpc-ibm-aix7.0 400 -4145 -271 14 -2 > powerpc64-linux-gnu 375 -2062 -160 117 -2 > powerpc64le-linux-gnu 491 -4169 -700 156 -2 > pru-elf 47 -7020 -6921 6 -1 > riscv32-elf 59 -1379 -139 7 -2 > riscv64-elf 89 -1562 -264 7 -1 > rl78-elf 289 -16157 -1665 42 -6 > rx-elf 82 -195 -53 8 -1 > s390-linux-gnu 128 -2108 -1485 63 -1 > s390x-linux-gnu 112 418 -32 522 -1 > sh-linux-gnu 218 -410 -108 68 -1 > sparc-linux-gnu 141 -866 -99 18 -1 > sparc64-linux-gnu 129 -792 -102 3 -2 > tilepro-linux-gnu 953 -4331 -297 332 -2 > v850-elf 50 -412 -53 2 -3 > vax-netbsdelf 254 -3328 -400 4 -2 > visium-elf 100 -693 -138 16 -1 > x86_64-darwin 345 -2134 -490 72 -1 > x86_64-linux-gnu 307 -843 -288 210 -1 > xstormy16-elf 218 -788 -156 59 -1 > xtensa-elf 195 -1426 -322 36 1 > > So the main benefit does seem to come from the def-use part. > > Here are some powerpc64le-linux-gnu examples of (B)->(C): > > gcc.c-torture/execute/20171008-1.c: > > @@ -79,8 +79,7 @@ > stdu 1,-32(1) > .LCFI5: > bl foo > - rlwinm 3,3,0,0xff > - cmpwi 0,3,0 > + andi. 9,3,0xff > bne 0,.L13 > addi 1,1,32 Soo this starts as insn_cost 4 for 6: r118:SI=r124:SI REG_DEAD r124:SI insn_cost 4 for 8: r121:SI=zero_extend(r118:SI#0) REG_DEAD r118:SI insn_cost 4 for 9: r122:CC=cmp(r121:SI,0) REG_DEAD r121:SI and then it combines 6->8 of course, but then Trying 8 -> 9: 8: r121:SI=zero_extend(r124:SI#0) REG_DEAD r124:SI 9: r122:CC=cmp(r121:SI,0) REG_DEAD r121:SI Failed to match this instruction: (set (reg:CC 122) (compare:CC (subreg:QI (reg:SI 124) 0) (const_int 0 [0]))) Hrm, that is a bad idea in general, why do we do that. > gcc.c-torture/execute/pr28982a.c: > > @@ -427,15 +427,13 @@ > stxvd2x 0,7,6 > .align 4 > .L9: > - xxlor 12,32,32 > + xvcvsxwsp 0,32 > vadduwm 0,0,1 > addi 10,9,16 > - xvcvsxwsp 0,12 > - xxlor 12,32,32 > + stxvd2x 0,0,9 > + xvcvsxwsp 0,32 > + addi 9,9,32 > vadduwm 0,0,1 > - stxvd2x 0,0,9 > - xvcvsxwsp 0,12 > - addi 9,9,32 > stxvd2x 0,0,10 > bdnz .L9 > li 3,4 > > (Disclaimer: I have no idea if that's correct.) This seems to be -O3 -mcpu=power8. It look to be correct just fine. It saves the two xxlor insns (which are just register move instructions). RA couldn't fix this up because there are two uses of the register (the xvcvsxwsp -- convert V4SI to V4SF; and the vadduwm -- V4SI addition), and RA doesn't reorder code. I would say your new code is better. > gcc.c-torture/execute/pr65215-3.c: > > @@ -56,11 +56,10 @@ > srdi 10,3,32 > srdi 9,3,56 > slwi 6,10,24 > - srwi 7,10,8 > + rlwinm 7,10,24,16,23 > or 9,9,6 > - rlwinm 7,7,0,16,23 > + rlwinm 10,10,8,8,15 > or 9,9,7 > - rlwinm 10,10,8,8,15 > or 9,9,10 > cmpw 0,9,8 > bne 0,.L4 insn_cost 4 for 15: r139:SI=r118:DI#0 0>>0x8 insn_cost 4 for 16: r140:SI=r139:SI&0xff00 REG_DEAD r139:SI (that's both of those insns setting r7). r118 does not die here yet, that's only in the 10,10 insn. Trying 15 -> 16: 15: r139:SI=r118:DI#0 0>>0x8 16: r140:SI=r139:SI&0xff00 REG_DEAD r139:SI Failed to match this instruction: (set (reg:SI 140) (and:SI (subreg:SI (zero_extract:DI (reg:DI 118 [ _2 ]) (const_int 32 [0x20]) (const_int 24 [0x18])) 0) (const_int 65280 [0xff00]))) Failed to match this instruction: (set (reg:SI 140) (and:SI (subreg:SI (and:DI (lshiftrt:DI (reg:DI 118 [ _2 ]) (const_int 8 [0x8])) (const_int 4294967295 [0xffffffff])) 0) (const_int 65280 [0xff00]))) Yeah, it's one of those, make_compound_insn :-/ (*^%$(*^$(*@^ > Just to emphasise though: I'm not proposing that we switch this on for > all targets yet. It would be opt-in until the pass is more mature. I do have to wonder if it is a bit late for stage 1. But opt-in as in, the user has to use some flag, that should be fine I guess? But default for some targets might not be so great, esp. primary targets. > But that FFR case is really important for the situation it handles. Yeah. I hope to have some time to review your actual patch soon. Should be less depressing than some of the combine failures :-) Segher