Takayuki 'January June' Suwa <[email protected]> writes:
> On 2024/06/20 22:34, Richard Sandiford wrote:
>> This patch adds a combine pass that runs late in the pipeline.
>> There are two instances: one between combine and split1, and one
>> after postreload.
>>
>> The pass currently has a single objective: remove definitions by
>> substituting into all uses. The pre-RA version tries to restrict
>> itself to cases that are likely to have a neutral or beneficial
>> effect on register pressure.
>>
>> The patch fixes PR106594. It also fixes a few FAILs and XFAILs
>> in the aarch64 test results, mostly due to making proper use of
>> MOVPRFX in cases where we didn't previously.
>>
>> This is just a first step. I'm hoping that the pass could be
>> used for other combine-related optimisations in future. In particular,
>> the post-RA version doesn't need to restrict itself to cases where all
>> uses are substitutable, since it doesn't have to worry about register
>> pressure. If we did that, and if we extended it to handle multi-register
>> REGs, the pass might be a viable replacement for regcprop, which in
>> turn might reduce the cost of having a post-RA instance of the new pass.
>>
>> On most targets, the pass is enabled by default at -O2 and above.
>> However, it has a tendency to undo x86's STV and RPAD passes,
>> by folding the more complex post-STV/RPAD form back into the
>> simpler pre-pass form.
>>
>> Also, running a pass after register allocation means that we can
>> now match define_insn_and_splits that were previously only matched
>> before register allocation. This trips things like:
>>
>> (define_insn_and_split "..."
>> [...pattern...]
>> "...cond..."
>> "#"
>> "&& 1"
>> [...pattern...]
>> {
>> ...unconditional use of gen_reg_rtx ()...;
>> }
>>
>> because matching and splitting after RA will call gen_reg_rtx when
>> pseudos are no longer allowed. rs6000 has several instances of this.
>
> xtensa also has something like that.
>
>> xtensa has a variation in which the split condition is:
>>
>> "&& can_create_pseudo_p ()"
>>
>> The failure then is that, if we match after RA, we'll never be
>> able to split the instruction.
>
> To be honest, I'm confusing by the possibility of adding a split pattern
> application opportunity that depends on the optimization options after
> Rel... ah, LRA and before the existing rtl-split2.
>
> Because I just recently submitted a patch that I expected would reliably
> (i.e. regardless of optimization options, etc.) apply the split pattern
> first in the rtl-split2 pass after RA, and it was merged.
>
>>
>> The patch therefore disables the pass by default on i386, rs6000
>> and xtensa. Hopefully we can fix those ports later (if their
>> maintainers want). It seems easier to add the pass first, though,
>> to make it easier to test any such fixes.
>>
>> gcc.target/aarch64/bitfield-bitint-abi-align{16,8}.c would need
>> quite a few updates for the late-combine output. That might be
>> worth doing, but it seems too complex to do as part of this patch.
>>
>> I tried compiling at least one target per CPU directory and comparing
>> the assembly output for parts of the GCC testsuite. This is just a way
>> of getting a flavour of how the pass performs; it obviously isn't a
>> meaningful benchmark. All targets seemed to improve on average:
>>
>> Target Tests Good Bad %Good Delta Median
>> ====== ===== ==== === ===== ===== ======
>> aarch64-linux-gnu 2215 1975 240 89.16% -4159 -1
>> aarch64_be-linux-gnu 1569 1483 86 94.52% -10117 -1
>> alpha-linux-gnu 1454 1370 84 94.22% -9502 -1
>> amdgcn-amdhsa 5122 4671 451 91.19% -35737 -1
>> arc-elf 2166 1932 234 89.20% -37742 -1
>> arm-linux-gnueabi 1953 1661 292 85.05% -12415 -1
>> arm-linux-gnueabihf 1834 1549 285 84.46% -11137 -1
>> avr-elf 4789 4330 459 90.42% -441276 -4
>> bfin-elf 2795 2394 401 85.65% -19252 -1
>> bpf-elf 3122 2928 194 93.79% -8785 -1
>> c6x-elf 2227 1929 298 86.62% -17339 -1
>> cris-elf 3464 3270 194 94.40% -23263 -2
>> csky-elf 2915 2591 324 88.89% -22146 -1
>> epiphany-elf 2399 2304 95 96.04% -28698 -2
>> fr30-elf 7712 7299 413 94.64% -99830 -2
>> frv-linux-gnu 3332 2877 455 86.34% -25108 -1
>> ft32-elf 2775 2667 108 96.11% -25029 -1
>> h8300-elf 3176 2862 314 90.11% -29305 -2
>> hppa64-hp-hpux11.23 4287 4247 40 99.07% -45963 -2
>> ia64-linux-gnu 2343 1946 397 83.06% -9907 -2
>> iq2000-elf 9684 9637 47 99.51% -126557 -2
>> lm32-elf 2681 2608 73 97.28% -59884 -3
>> loongarch64-linux-gnu 1303 1218 85 93.48% -13375 -2
>> m32r-elf 1626 1517 109 93.30% -9323 -2
>> m68k-linux-gnu 3022 2620 402 86.70% -21531 -1
>> mcore-elf 2315 2085 230 90.06% -24160 -1
>> microblaze-elf 2782 2585 197 92.92% -16530 -1
>> mipsel-linux-gnu 1958 1827 131 93.31% -15462 -1
>> mipsisa64-linux-gnu 1655 1488 167 89.91% -16592 -2
>> mmix 4914 4814 100 97.96% -63021 -1
>> mn10300-elf 3639 3320 319 91.23% -34752 -2
>> moxie-rtems 3497 3252 245 92.99% -87305 -3
>> msp430-elf 4353 3876 477 89.04% -23780 -1
>> nds32le-elf 3042 2780 262 91.39% -27320 -1
>> nios2-linux-gnu 1683 1355 328 80.51% -8065 -1
>> nvptx-none 2114 1781 333 84.25% -12589 -2
>> or1k-elf 3045 2699 346 88.64% -14328 -2
>> pdp11 4515 4146 369 91.83% -26047 -2
>> pru-elf 1585 1245 340 78.55% -5225 -1
>> riscv32-elf 2122 2000 122 94.25% -101162 -2
>> riscv64-elf 1841 1726 115 93.75% -49997 -2
>> rl78-elf 2823 2530 293 89.62% -40742 -4
>> rx-elf 2614 2480 134 94.87% -18863 -1
>> s390-linux-gnu 1591 1393 198 87.55% -16696 -1
>> s390x-linux-gnu 2015 1879 136 93.25% -21134 -1
>> sh-linux-gnu 1870 1507 363 80.59% -9491 -1
>> sparc-linux-gnu 1123 1075 48 95.73% -14503 -1
>> sparc-wrs-vxworks 1121 1073 48 95.72% -14578 -1
>> sparc64-linux-gnu 1096 1021 75 93.16% -15003 -1
>> v850-elf 1897 1728 169 91.09% -11078 -1
>> vax-netbsdelf 3035 2995 40 98.68% -27642 -1
>> visium-elf 1392 1106 286 79.45% -7984 -2
>> xstormy16-elf 2577 2071 506 80.36% -13061 -1
>>
>> ** snip **
>
> To be more frank, once a split pattern is defined, it is applied by
> the existing five split paths and possibly by combiners. In most cases,
> it is enough to apply it in one of these places, or that is what the
> pattern creator intended.
>
> Wouldn't applying the split pattern indiscriminately in various places
> be a waste of execution resources and bring about unexpected and undesired
> results?
>
> I think we need some way to properly control the application of the split
> pattern, perhaps some predicate function.
The problem is more the define_insn part of the define_insn_and_split,
rather than the define_split part. The number and location of the split
passes is the same: anything matched by rtl-late_combine1 will be split by
rtl-split1 and anything matched by rtl-late_combine2 will be split by
rtl-split2. (If the split condition allows it, of course.)
But more things can be matched by rtl-late_combine2 than are matched by
other post-RA passes like rtl-postreload. And that's what causes the
issue. If:
(define_insn_and_split "..."
[...pattern...]
"...cond..."
"#"
"&& 1"
[...pattern...]
{
...unconditional use of gen_reg_rtx ()...;
}
is matched by rtl-late_combine2, the split will be done by rtl-split2.
But the split will ICE, because it isn't valid to call gen_reg_rtx after
register allocation.
Similarly, if:
(define_insn_and_split "..."
[...pattern...]
"...cond..."
"#"
"&& can_create_pseudo_p ()"
[...pattern...]
{
...unconditional use of gen_reg_rtx ()...;
}
is matched by rtl-late_combine2, the can_create_pseudo_p condition will
be false in rtl-split2, and in all subsequent split passes. So we'll
still have the unsplit instruction during final, which will ICE because
it doesn't have a valid means of implementing the "#".
The traditional (and IMO correct) way to handle this is to make the
pattern reserve the temporary registers that it needs, using match_scratches.
rs6000 has many examples of this. E.g.:
(define_insn_and_split "@ieee_128bit_vsx_neg<mode>2"
[(set (match_operand:IEEE128 0 "register_operand" "=wa")
(neg:IEEE128 (match_operand:IEEE128 1 "register_operand" "wa")))
(clobber (match_scratch:V16QI 2 "=v"))]
"TARGET_FLOAT128_TYPE && !TARGET_FLOAT128_HW"
"#"
"&& 1"
[(parallel [(set (match_dup 0)
(neg:IEEE128 (match_dup 1)))
(use (match_dup 2))])]
{
if (GET_CODE (operands[2]) == SCRATCH)
operands[2] = gen_reg_rtx (V16QImode);
emit_insn (gen_ieee_128bit_negative_zero (operands[2]));
}
[(set_attr "length" "8")
(set_attr "type" "vecsimple")])
Before RA, this is just:
(set ...)
(clobber (scratch:V16QI))
and the split creates a new register. After RA, operand 2 provides
the required temporary register:
(set ...)
(clobber (reg:V16QI TMP))
Another approach is to add can_create_pseudo_p () to the define_insn
condition (rather than the split condition). But IMO that's an ICE
trap, since insns that have already been matched & accepted shouldn't
suddenly become invalid if recog is reattempted later.
Thanks,
Richard