Takayuki 'January June' Suwa <jjsuwa_sys3...@yahoo.co.jp> writes: > On 2024/06/20 22:34, Richard Sandiford wrote: >> This patch adds a combine pass that runs late in the pipeline. >> There are two instances: one between combine and split1, and one >> after postreload. >> >> The pass currently has a single objective: remove definitions by >> substituting into all uses. The pre-RA version tries to restrict >> itself to cases that are likely to have a neutral or beneficial >> effect on register pressure. >> >> The patch fixes PR106594. It also fixes a few FAILs and XFAILs >> in the aarch64 test results, mostly due to making proper use of >> MOVPRFX in cases where we didn't previously. >> >> This is just a first step. I'm hoping that the pass could be >> used for other combine-related optimisations in future. In particular, >> the post-RA version doesn't need to restrict itself to cases where all >> uses are substitutable, since it doesn't have to worry about register >> pressure. If we did that, and if we extended it to handle multi-register >> REGs, the pass might be a viable replacement for regcprop, which in >> turn might reduce the cost of having a post-RA instance of the new pass. >> >> On most targets, the pass is enabled by default at -O2 and above. >> However, it has a tendency to undo x86's STV and RPAD passes, >> by folding the more complex post-STV/RPAD form back into the >> simpler pre-pass form. >> >> Also, running a pass after register allocation means that we can >> now match define_insn_and_splits that were previously only matched >> before register allocation. This trips things like: >> >> (define_insn_and_split "..." >> [...pattern...] >> "...cond..." >> "#" >> "&& 1" >> [...pattern...] >> { >> ...unconditional use of gen_reg_rtx ()...; >> } >> >> because matching and splitting after RA will call gen_reg_rtx when >> pseudos are no longer allowed. rs6000 has several instances of this. > > xtensa also has something like that. > >> xtensa has a variation in which the split condition is: >> >> "&& can_create_pseudo_p ()" >> >> The failure then is that, if we match after RA, we'll never be >> able to split the instruction. > > To be honest, I'm confusing by the possibility of adding a split pattern > application opportunity that depends on the optimization options after > Rel... ah, LRA and before the existing rtl-split2. > > Because I just recently submitted a patch that I expected would reliably > (i.e. regardless of optimization options, etc.) apply the split pattern > first in the rtl-split2 pass after RA, and it was merged. > >> >> The patch therefore disables the pass by default on i386, rs6000 >> and xtensa. Hopefully we can fix those ports later (if their >> maintainers want). It seems easier to add the pass first, though, >> to make it easier to test any such fixes. >> >> gcc.target/aarch64/bitfield-bitint-abi-align{16,8}.c would need >> quite a few updates for the late-combine output. That might be >> worth doing, but it seems too complex to do as part of this patch. >> >> I tried compiling at least one target per CPU directory and comparing >> the assembly output for parts of the GCC testsuite. This is just a way >> of getting a flavour of how the pass performs; it obviously isn't a >> meaningful benchmark. All targets seemed to improve on average: >> >> Target Tests Good Bad %Good Delta Median >> ====== ===== ==== === ===== ===== ====== >> aarch64-linux-gnu 2215 1975 240 89.16% -4159 -1 >> aarch64_be-linux-gnu 1569 1483 86 94.52% -10117 -1 >> alpha-linux-gnu 1454 1370 84 94.22% -9502 -1 >> amdgcn-amdhsa 5122 4671 451 91.19% -35737 -1 >> arc-elf 2166 1932 234 89.20% -37742 -1 >> arm-linux-gnueabi 1953 1661 292 85.05% -12415 -1 >> arm-linux-gnueabihf 1834 1549 285 84.46% -11137 -1 >> avr-elf 4789 4330 459 90.42% -441276 -4 >> bfin-elf 2795 2394 401 85.65% -19252 -1 >> bpf-elf 3122 2928 194 93.79% -8785 -1 >> c6x-elf 2227 1929 298 86.62% -17339 -1 >> cris-elf 3464 3270 194 94.40% -23263 -2 >> csky-elf 2915 2591 324 88.89% -22146 -1 >> epiphany-elf 2399 2304 95 96.04% -28698 -2 >> fr30-elf 7712 7299 413 94.64% -99830 -2 >> frv-linux-gnu 3332 2877 455 86.34% -25108 -1 >> ft32-elf 2775 2667 108 96.11% -25029 -1 >> h8300-elf 3176 2862 314 90.11% -29305 -2 >> hppa64-hp-hpux11.23 4287 4247 40 99.07% -45963 -2 >> ia64-linux-gnu 2343 1946 397 83.06% -9907 -2 >> iq2000-elf 9684 9637 47 99.51% -126557 -2 >> lm32-elf 2681 2608 73 97.28% -59884 -3 >> loongarch64-linux-gnu 1303 1218 85 93.48% -13375 -2 >> m32r-elf 1626 1517 109 93.30% -9323 -2 >> m68k-linux-gnu 3022 2620 402 86.70% -21531 -1 >> mcore-elf 2315 2085 230 90.06% -24160 -1 >> microblaze-elf 2782 2585 197 92.92% -16530 -1 >> mipsel-linux-gnu 1958 1827 131 93.31% -15462 -1 >> mipsisa64-linux-gnu 1655 1488 167 89.91% -16592 -2 >> mmix 4914 4814 100 97.96% -63021 -1 >> mn10300-elf 3639 3320 319 91.23% -34752 -2 >> moxie-rtems 3497 3252 245 92.99% -87305 -3 >> msp430-elf 4353 3876 477 89.04% -23780 -1 >> nds32le-elf 3042 2780 262 91.39% -27320 -1 >> nios2-linux-gnu 1683 1355 328 80.51% -8065 -1 >> nvptx-none 2114 1781 333 84.25% -12589 -2 >> or1k-elf 3045 2699 346 88.64% -14328 -2 >> pdp11 4515 4146 369 91.83% -26047 -2 >> pru-elf 1585 1245 340 78.55% -5225 -1 >> riscv32-elf 2122 2000 122 94.25% -101162 -2 >> riscv64-elf 1841 1726 115 93.75% -49997 -2 >> rl78-elf 2823 2530 293 89.62% -40742 -4 >> rx-elf 2614 2480 134 94.87% -18863 -1 >> s390-linux-gnu 1591 1393 198 87.55% -16696 -1 >> s390x-linux-gnu 2015 1879 136 93.25% -21134 -1 >> sh-linux-gnu 1870 1507 363 80.59% -9491 -1 >> sparc-linux-gnu 1123 1075 48 95.73% -14503 -1 >> sparc-wrs-vxworks 1121 1073 48 95.72% -14578 -1 >> sparc64-linux-gnu 1096 1021 75 93.16% -15003 -1 >> v850-elf 1897 1728 169 91.09% -11078 -1 >> vax-netbsdelf 3035 2995 40 98.68% -27642 -1 >> visium-elf 1392 1106 286 79.45% -7984 -2 >> xstormy16-elf 2577 2071 506 80.36% -13061 -1 >> >> ** snip ** > > To be more frank, once a split pattern is defined, it is applied by > the existing five split paths and possibly by combiners. In most cases, > it is enough to apply it in one of these places, or that is what the > pattern creator intended. > > Wouldn't applying the split pattern indiscriminately in various places > be a waste of execution resources and bring about unexpected and undesired > results? > > I think we need some way to properly control the application of the split > pattern, perhaps some predicate function.
The problem is more the define_insn part of the define_insn_and_split, rather than the define_split part. The number and location of the split passes is the same: anything matched by rtl-late_combine1 will be split by rtl-split1 and anything matched by rtl-late_combine2 will be split by rtl-split2. (If the split condition allows it, of course.) But more things can be matched by rtl-late_combine2 than are matched by other post-RA passes like rtl-postreload. And that's what causes the issue. If: (define_insn_and_split "..." [...pattern...] "...cond..." "#" "&& 1" [...pattern...] { ...unconditional use of gen_reg_rtx ()...; } is matched by rtl-late_combine2, the split will be done by rtl-split2. But the split will ICE, because it isn't valid to call gen_reg_rtx after register allocation. Similarly, if: (define_insn_and_split "..." [...pattern...] "...cond..." "#" "&& can_create_pseudo_p ()" [...pattern...] { ...unconditional use of gen_reg_rtx ()...; } is matched by rtl-late_combine2, the can_create_pseudo_p condition will be false in rtl-split2, and in all subsequent split passes. So we'll still have the unsplit instruction during final, which will ICE because it doesn't have a valid means of implementing the "#". The traditional (and IMO correct) way to handle this is to make the pattern reserve the temporary registers that it needs, using match_scratches. rs6000 has many examples of this. E.g.: (define_insn_and_split "@ieee_128bit_vsx_neg<mode>2" [(set (match_operand:IEEE128 0 "register_operand" "=wa") (neg:IEEE128 (match_operand:IEEE128 1 "register_operand" "wa"))) (clobber (match_scratch:V16QI 2 "=v"))] "TARGET_FLOAT128_TYPE && !TARGET_FLOAT128_HW" "#" "&& 1" [(parallel [(set (match_dup 0) (neg:IEEE128 (match_dup 1))) (use (match_dup 2))])] { if (GET_CODE (operands[2]) == SCRATCH) operands[2] = gen_reg_rtx (V16QImode); emit_insn (gen_ieee_128bit_negative_zero (operands[2])); } [(set_attr "length" "8") (set_attr "type" "vecsimple")]) Before RA, this is just: (set ...) (clobber (scratch:V16QI)) and the split creates a new register. After RA, operand 2 provides the required temporary register: (set ...) (clobber (reg:V16QI TMP)) Another approach is to add can_create_pseudo_p () to the define_insn condition (rather than the split condition). But IMO that's an ICE trap, since insns that have already been matched & accepted shouldn't suddenly become invalid if recog is reattempted later. Thanks, Richard