Re: [PATCH 6/6] Add a late-combine pass [PR106594]

Richard Sandiford Sat, 22 Jun 2024 09:49:34 -0700

Takayuki 'January June' Suwa <jjsuwa_sys3...@yahoo.co.jp> writes:
> On 2024/06/20 22:34, Richard Sandiford wrote:
>> This patch adds a combine pass that runs late in the pipeline.
>> There are two instances: one between combine and split1, and one
>> after postreload.
>> 
>> The pass currently has a single objective: remove definitions by
>> substituting into all uses.  The pre-RA version tries to restrict
>> itself to cases that are likely to have a neutral or beneficial
>> effect on register pressure.
>> 
>> The patch fixes PR106594.  It also fixes a few FAILs and XFAILs
>> in the aarch64 test results, mostly due to making proper use of
>> MOVPRFX in cases where we didn't previously.
>> 
>> This is just a first step.  I'm hoping that the pass could be
>> used for other combine-related optimisations in future.  In particular,
>> the post-RA version doesn't need to restrict itself to cases where all
>> uses are substitutable, since it doesn't have to worry about register
>> pressure.  If we did that, and if we extended it to handle multi-register
>> REGs, the pass might be a viable replacement for regcprop, which in
>> turn might reduce the cost of having a post-RA instance of the new pass.
>> 
>> On most targets, the pass is enabled by default at -O2 and above.
>> However, it has a tendency to undo x86's STV and RPAD passes,
>> by folding the more complex post-STV/RPAD form back into the
>> simpler pre-pass form.
>> 
>> Also, running a pass after register allocation means that we can
>> now match define_insn_and_splits that were previously only matched
>> before register allocation.  This trips things like:
>> 
>>    (define_insn_and_split "..."
>>      [...pattern...]
>>      "...cond..."
>>      "#"
>>      "&& 1"
>>      [...pattern...]
>>      {
>>        ...unconditional use of gen_reg_rtx ()...;
>>      }
>> 
>> because matching and splitting after RA will call gen_reg_rtx when
>> pseudos are no longer allowed.  rs6000 has several instances of this.
>
> xtensa also has something like that.
>
>> xtensa has a variation in which the split condition is:
>> 
>>      "&& can_create_pseudo_p ()"
>> 
>> The failure then is that, if we match after RA, we'll never be
>> able to split the instruction.
>
> To be honest, I'm confusing by the possibility of adding a split pattern
> application opportunity that depends on the optimization options after
> Rel... ah, LRA and before the existing rtl-split2.
>
> Because I just recently submitted a patch that I expected would reliably
> (i.e. regardless of optimization options, etc.) apply the split pattern
> first in the rtl-split2 pass after RA, and it was merged.
>
>> 
>> The patch therefore disables the pass by default on i386, rs6000
>> and xtensa.  Hopefully we can fix those ports later (if their
>> maintainers want).  It seems easier to add the pass first, though,
>> to make it easier to test any such fixes.
>> 
>> gcc.target/aarch64/bitfield-bitint-abi-align{16,8}.c would need
>> quite a few updates for the late-combine output.  That might be
>> worth doing, but it seems too complex to do as part of this patch.
>> 
>> I tried compiling at least one target per CPU directory and comparing
>> the assembly output for parts of the GCC testsuite.  This is just a way
>> of getting a flavour of how the pass performs; it obviously isn't a
>> meaningful benchmark.  All targets seemed to improve on average:
>> 
>> Target                 Tests   Good    Bad   %Good   Delta  Median
>> ======                 =====   ====    ===   =====   =====  ======
>> aarch64-linux-gnu       2215   1975    240  89.16%   -4159      -1
>> aarch64_be-linux-gnu    1569   1483     86  94.52%  -10117      -1
>> alpha-linux-gnu         1454   1370     84  94.22%   -9502      -1
>> amdgcn-amdhsa           5122   4671    451  91.19%  -35737      -1
>> arc-elf                 2166   1932    234  89.20%  -37742      -1
>> arm-linux-gnueabi       1953   1661    292  85.05%  -12415      -1
>> arm-linux-gnueabihf     1834   1549    285  84.46%  -11137      -1
>> avr-elf                 4789   4330    459  90.42% -441276      -4
>> bfin-elf                2795   2394    401  85.65%  -19252      -1
>> bpf-elf                 3122   2928    194  93.79%   -8785      -1
>> c6x-elf                 2227   1929    298  86.62%  -17339      -1
>> cris-elf                3464   3270    194  94.40%  -23263      -2
>> csky-elf                2915   2591    324  88.89%  -22146      -1
>> epiphany-elf            2399   2304     95  96.04%  -28698      -2
>> fr30-elf                7712   7299    413  94.64%  -99830      -2
>> frv-linux-gnu           3332   2877    455  86.34%  -25108      -1
>> ft32-elf                2775   2667    108  96.11%  -25029      -1
>> h8300-elf               3176   2862    314  90.11%  -29305      -2
>> hppa64-hp-hpux11.23     4287   4247     40  99.07%  -45963      -2
>> ia64-linux-gnu          2343   1946    397  83.06%   -9907      -2
>> iq2000-elf              9684   9637     47  99.51% -126557      -2
>> lm32-elf                2681   2608     73  97.28%  -59884      -3
>> loongarch64-linux-gnu   1303   1218     85  93.48%  -13375      -2
>> m32r-elf                1626   1517    109  93.30%   -9323      -2
>> m68k-linux-gnu          3022   2620    402  86.70%  -21531      -1
>> mcore-elf               2315   2085    230  90.06%  -24160      -1
>> microblaze-elf          2782   2585    197  92.92%  -16530      -1
>> mipsel-linux-gnu        1958   1827    131  93.31%  -15462      -1
>> mipsisa64-linux-gnu     1655   1488    167  89.91%  -16592      -2
>> mmix                    4914   4814    100  97.96%  -63021      -1
>> mn10300-elf             3639   3320    319  91.23%  -34752      -2
>> moxie-rtems             3497   3252    245  92.99%  -87305      -3
>> msp430-elf              4353   3876    477  89.04%  -23780      -1
>> nds32le-elf             3042   2780    262  91.39%  -27320      -1
>> nios2-linux-gnu         1683   1355    328  80.51%   -8065      -1
>> nvptx-none              2114   1781    333  84.25%  -12589      -2
>> or1k-elf                3045   2699    346  88.64%  -14328      -2
>> pdp11                   4515   4146    369  91.83%  -26047      -2
>> pru-elf                 1585   1245    340  78.55%   -5225      -1
>> riscv32-elf             2122   2000    122  94.25% -101162      -2
>> riscv64-elf             1841   1726    115  93.75%  -49997      -2
>> rl78-elf                2823   2530    293  89.62%  -40742      -4
>> rx-elf                  2614   2480    134  94.87%  -18863      -1
>> s390-linux-gnu          1591   1393    198  87.55%  -16696      -1
>> s390x-linux-gnu         2015   1879    136  93.25%  -21134      -1
>> sh-linux-gnu            1870   1507    363  80.59%   -9491      -1
>> sparc-linux-gnu         1123   1075     48  95.73%  -14503      -1
>> sparc-wrs-vxworks       1121   1073     48  95.72%  -14578      -1
>> sparc64-linux-gnu       1096   1021     75  93.16%  -15003      -1
>> v850-elf                1897   1728    169  91.09%  -11078      -1
>> vax-netbsdelf           3035   2995     40  98.68%  -27642      -1
>> visium-elf              1392   1106    286  79.45%   -7984      -2
>> xstormy16-elf           2577   2071    506  80.36%  -13061      -1
>> 
>> ** snip **
>
> To be more frank, once a split pattern is defined, it is applied by
> the existing five split paths and possibly by combiners. In most cases,
> it is enough to apply it in one of these places, or that is what the
> pattern creator intended.
>
> Wouldn't applying the split pattern indiscriminately in various places
> be a waste of execution resources and bring about unexpected and undesired
> results?
>
> I think we need some way to properly control the application of the split
> pattern, perhaps some predicate function.


The problem is more the define_insn part of the define_insn_and_split,
rather than the define_split part.  The number and location of the split
passes is the same: anything matched by rtl-late_combine1 will be split by
rtl-split1 and anything matched by rtl-late_combine2 will be split by
rtl-split2.  (If the split condition allows it, of course.)

But more things can be matched by rtl-late_combine2 than are matched by
other post-RA passes like rtl-postreload.  And that's what causes the
issue.  If:

    (define_insn_and_split "..."
      [...pattern...]
      "...cond..."
      "#"
      "&& 1"
      [...pattern...]
      {
        ...unconditional use of gen_reg_rtx ()...;
      }

is matched by rtl-late_combine2, the split will be done by rtl-split2.
But the split will ICE, because it isn't valid to call gen_reg_rtx after
register allocation.

Similarly, if:

    (define_insn_and_split "..."
      [...pattern...]
      "...cond..."
      "#"
      "&& can_create_pseudo_p ()"
      [...pattern...]
      {
        ...unconditional use of gen_reg_rtx ()...;
      }

is matched by rtl-late_combine2, the can_create_pseudo_p condition will
be false in rtl-split2, and in all subsequent split passes.  So we'll
still have the unsplit instruction during final, which will ICE because
it doesn't have a valid means of implementing the "#".

The traditional (and IMO correct) way to handle this is to make the
pattern reserve the temporary registers that it needs, using match_scratches.
rs6000 has many examples of this.  E.g.:

(define_insn_and_split "@ieee_128bit_vsx_neg<mode>2"
  [(set (match_operand:IEEE128 0 "register_operand" "=wa")
        (neg:IEEE128 (match_operand:IEEE128 1 "register_operand" "wa")))
   (clobber (match_scratch:V16QI 2 "=v"))]
  "TARGET_FLOAT128_TYPE && !TARGET_FLOAT128_HW"
  "#"
  "&& 1"
  [(parallel [(set (match_dup 0)
                   (neg:IEEE128 (match_dup 1)))
              (use (match_dup 2))])]
{
  if (GET_CODE (operands[2]) == SCRATCH)
    operands[2] = gen_reg_rtx (V16QImode);

  emit_insn (gen_ieee_128bit_negative_zero (operands[2]));
}
  [(set_attr "length" "8")
   (set_attr "type" "vecsimple")])

Before RA, this is just:

  (set ...)
  (clobber (scratch:V16QI))

and the split creates a new register.  After RA, operand 2 provides
the required temporary register:

  (set ...)
  (clobber (reg:V16QI TMP))

Another approach is to add can_create_pseudo_p () to the define_insn
condition (rather than the split condition).  But IMO that's an ICE
trap, since insns that have already been matched & accepted shouldn't
suddenly become invalid if recog is reattempted later.

Thanks,
Richard

Re: [PATCH 6/6] Add a late-combine pass [PR106594]

Reply via email to