Re: Add a new combine pass

Richard Sandiford Wed, 20 Nov 2019 10:21:37 -0800

Segher Boessenkool <seg...@kernel.crashing.org> writes:
>>       /* Make sure that the value that is to be substituted for the register
>>       does not use any registers whose values alter in between.  However,
>>       If the insns are adjacent, a use can't cross a set even though we
>>       think it might (this can happen for a sequence of insns each setting
>>       the same destination; last_set of that register might point to
>>       a NOTE).  If INSN has a REG_EQUIV note, the register is always
>>       equivalent to the memory so the substitution is valid even if there
>>       are intervening stores.  Also, don't move a volatile asm or
>>       UNSPEC_VOLATILE across any other insns.  */
>>       || (! all_adjacent
>>        && (((!MEM_P (src)
>>              || ! find_reg_note (insn, REG_EQUIV, src))
>>             && modified_between_p (src, insn, i3))
>>            || (GET_CODE (src) == ASM_OPERANDS && MEM_VOLATILE_P (src))
>>            || GET_CODE (src) == UNSPEC_VOLATILE))
>
> So this would work if you had pseudos here, instead of the hard reg?
> Because it is a hard reg it is the same number in both places, making it
> hard to move.


Yeah, probably.  But the hard reg is a critical part of this.
Going back to the example:

(set (reg/v:VNx16BI 102 [ ok ])
     (reg:VNx16BI 85 ffrt))
(set (reg:VNx16BI 85 ffrt)
     (unspec:VNx16BI [(reg:VNx16BI 85 ffrt)] UNSPEC_UPDATE_FFRT))
(set (reg:CC_NZC 66 cc)
     (unspec:CC_NZC
       [(reg:VNx16BI 106) repeated x2
        (const_int 1 [0x1])
        (reg/v:VNx16BI 102 [ ok ])] UNSPEC_PTEST))

FFR is the real first fault register.  FFRT is actually a fake register
whose only purpose is to describe the dependencies (in rtl) between writes
to the FFR, reads from the FFR and first-faulting loads.  The whole scheme
depends on having only one fixed FFRT register.

>> > How are dependencies represented in your new pass?  If it just does
>> > walks over the insn stream for everything, you get quadratic complexity
>> > if you move insns backwards.  We have that in combine already, mostly
>> > from modified_between_p, but that is limited because of how LOG_LINKS
>> > work, and we have been doing this for so long and there are no problems
>> > found with it, so it must work in practice.  But I am worried about it
>> > when moving insns back an unlimited distance.
>> 
>> It builds def-use chains, but using a constant limit on the number of
>> explicitly-recorded uses.  All other uses go in a numerical live range
>> from which they (conservatively) never escape.  The def-use chains
>> represent memory as a single entity, a bit like in gimple.
>
> Ah.  So that range thing ensures correctness.

Yeah.

> Why don't you use DF for the DU chains?

The problem with DF_DU_CHAIN is that it's quadratic in the worst case.
fwprop.c gets around that by using the MD problem and having its own
dominator walker to calculate limited def-use chains:

  /* We use the multiple definitions problem to compute our restricted
     use-def chains.  */

So taking that approach here would still require some amount of
roll-your-own.  Other reasons are:

* Even what fwprop does is more elaborate than we need for now.

* We need to handle memory too, and it's nice to be able to handle
  it in the same way as registers.

* Updating a full, ordered def-use chain after a move is a linear-time
  operation, so whatever happens, we'd need to apply some kind of limit
  on the number of uses we maintain, with something like that integer
  point range for the rest.

* Once we've analysed the insn and built its def-use chains, we don't
  look at the df_refs again until we update the chains after a successful
  combination.  So it should be more efficient to maintain a small array
  of insn_info_rec pointers alongside the numerical range, rather than
  walk and pollute chains of df_refs and then link back the insn uids
  to the pass-local info.

>> >> - it tries using REG_EQUAL notes for the final instruction.
>> >
>> > And that.
>> 
>> I meant REG_EQUAL notes on i3, i.e. it tries replacing the src of i3
>> with i3's REG_EQUAL note and combining into that.  Does combine do that?
>> I couldn't see it, and in:
>> 
>>    https://gcc.gnu.org/ml/gcc/2019-06/msg00148.html
>> 
>> you seemed to reject the idea of allowing it.
>
> Yes, I still do.  Do you have an example where it helps?

I'll run another set of tests for that.

>> >> - it can parallelise two independent instructions that both read from
>> >>   the same register or both read from memory.
>> >
>> > That only if somehow there is a link between the two (so essentially
>> > never).  The only combinations tried by combine are those via LOG_LINKs,
>> > which are between a SET and the first corresponding use.  This is a key
>> > factor that makes it kind of linear (instead of exponential) complexity.
>> 
>> Tracking limited def-use chains is what makes this last bit easy.
>> We can just try parallelising two instructions from the (bounded) list
>> of uses.  And for this case there's not any garbage rtl involved, since
>> we reuse the same PARALLEL rtx between attempts.  The cost is basically
>> all in the recog call (which would obviously mount up if we went
>> overboard).
>
> *All* examples above and below are just this.

Yeah, the powerpc and s390x examples were.  The motivating FFR example
above isn't though: it's a def-use combination in parallel with the
existing definition.

> If you disable everything else, what do the statistics look like then?

Had no idea how this would turn out -- which is a good sign it was
worth doing -- but: results below.

>> > One thing I want to do is some mini-combine after every split, probably
>> > only with the insns new from the split.  But we have no cfglayout mode
>> > anymore then, and only hard regs (except in the first split pass, which
>> > is just a little later than your new pass).
>> 
>> Yeah, sounds like it could be useful.  I guess there'd need to be
>> an extra condition on the combination that the new insn can't be
>> immediately split.
>
> It would run *after* split.  Not interleaved with it.

Yeah.  But what I meant was: a lot of insns that are split after
reload are combined for RA purposes and the split form is really
the preferred form (especially for scheduling).  So if we have
a combine pass *after* split, I think it should avoid using any
combination that matches a split.

>> > And amount of garbage produced?
>> 
>> If -ftime-report stats are accurate, then the total amount of
>> memory allocated is:
>> 
>> run-combine=2 (normal combine): 1793 kB
>> run-combine=4 (new pass only):    98 kB
>> run-combine=6 (both passes):    1871 kB (new pass accounts for 78 kB)
>> 
>> But again that's not a fair comparison when the main combine pass does more.
>
> The way combine does SUBST is pretty fundamental to how it works (it can
> be ripped out, and probably we'll have to at some point, but that will be
> very invasive).  Originally all this temporary RTL was on obstacks and
> reaping it was cheap, but everything is GCed now (fixing the bugs was not
> cheap :-) )

Yeah, I remember :-)

> If you look at even really bad cases, combine is still only a few
> percent of total, so it isn't too bad.
>
>> I did try hard to keep the amount of garbage rtl down though.  This is
>> why I added validate_simplify_replace_rtx rather than trying to make
>> do with existing routines.  It should only create new rtl if the
>> simplification routines did something useful.  (Of course, that's mostly
>> true of combine as well, but things like the make_compound_operation/
>> expand_compound_operation wrangler can create expressions that are never
>> actually useful.)
>
> Don't mention those, thanks :-)
>
>> >> To get a feel for the effect on multiple targets, I did my usual
>> >> bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg
>> >> and g++.dg, this time comparing run-combine=2 and run-combine=6
>> >> using -O2 -ftree-vectorize:
>> >
>> > One problem with this is that these are very short functions on average.
>> 
>> There are some long ones too :-)
>
> Yes, but this isn't a good stand-in for representative programs.

Right.  And number of lines of asm isn't a good stand-in for anything much.
Like I say, the whole thing is just to get a feel, on tests that are readily
to hand and are easy to compile without a full toolchain.

>> > What is the kind of changes you see for other targets?
>> 
>> On powerpc64le-linux-gnu it mostly comes from eliminating comparisons
>> in favour of other flag-setting instructions and making more use of
>> post-increments.  Not sure the last one is actually a win, but the
>> target costs say it's OK :-).  E.g. from gcc.c-torture/execute/pr78675.c:
>> 
>> @@ -48,9 +48,8 @@
>>         blr
>>         .align 4
>>  .L19:
>> -       cmpdi 0,10,0
>> +       mr. 9,10
>>         mr 3,8
>> -       mr 9,10
>>         bne 0,.L9
>>         b .L3
>>         .align 4
>
> Okay, so this combining two uses of r10 into one insn.
>
> This isn't necessarily a good idea: the combined insn cannot be moved as
> much as one of its components could, which can also immediately prevent
> further combinations.
>
> But doing this after combine, as you do, is probably beneficial.
>
>> and a slightly more interesting example in gcc.c-torture/execute/loop-6.c:
>
> This is the same thing (we do andi. a,b,0xff instead of rlwinm. a,b,0,0xff
> because this is cheaper on p7 and p8).
>
>> gcc.c-torture/execute/20081218-1.c is an example where we make more use
>> of post-increment:
>> 
>>  .L9:
>> -       lbz 10,1(9)
>> -       addi 9,9,1
>> +       lbzu 10,1(9)
>>         cmpwi 0,10,38
>>         bne 0,.L8
>> -       lbz 10,1(9)
>> -       addi 9,9,1
>> +       lbzu 10,1(9)
>>         cmpwi 0,10,38
>>         bne 0,.L8
>>         bdnz .L9
>
> Pre-increment (we only *have* pre-modify memory accesses).

Oops, yes.

>>   /* Mimic combine's behavior by not combining moves from allocatable hard
>>      registers (e.g. when copying parameters or function return values).  */
>>   if (REG_P (src) && HARD_REGISTER_P (src) && !fixed_regs[REGNO (src)])
>>     return false;
>> 
>> Although if that could have accounted for the difference, it sounds like
>> we're leaving a lot on the table by doing this :-)
>
> It actually helps (and quite a bit).  But if your test cases are mainly
> tiny functions, anything can happen.  But since you see this across all
> targets, it must be doing something good :-)
>
>
> So I'd love to see statistics for *only* combining two uses of the same
> thing, this is something combine cannot do, and arguably *shouldn't* do!

OK, here are two sets of results.  The first is for:

(A) --param run-combine=2 (current combine only)
(B) --param run-combine=6 (both passes), use-use combinations only

Target                 Tests   Delta    Best   Worst  Median
======                 =====   =====    ====   =====  ======
aarch64-linux-gnu        158    3060     -72     520      -1
aarch64_be-linux-gnu     111      24     -57     324      -1
alpha-linux-gnu            3       3       1       1       1
amdgcn-amdhsa             18      71     -17      26       1
arc-elf                  310   -4414   -1516     356       1
arm-linux-gnueabi         28     -50     -13       3      -1
arm-linux-gnueabihf       28     -50     -13       3      -1
avr-elf                   26     308      -1      36      12
bfin-elf                   6       8      -1       3       1
bpf-elf                   10      21      -1       6       1
c6x-elf                    7       9      -6       6       1
cr16-elf                  13     102       1      27       2
cris-elf                  35   -1001    -700       3      -2
csky-elf                   9      28       1       6       2
epiphany-elf              29     -29      -2       1      -1
fr30-elf                  12      17      -1       5       1
frv-linux-gnu              1      -2      -2      -2      -2
ft32-elf                  10      22      -1       5       2
h8300-elf                 29      56     -22      14       2
hppa64-hp-hpux11.23        9      17      -1       4       2
i686-apple-darwin         10     -33     -20      12      -2
i686-pc-linux-gnu         41     243     -12      33       3
ia64-linux-gnu            28     -32     -29      39      -4
iq2000-elf                 6       8       1       2       1
lm32-elf                  10      12      -3       5       1
m32r-elf                   3       2      -2       2       2
m68k-linux-gnu            19      27      -2       5       2
mcore-elf                 14      23     -10       6       2
microblaze-elf             5       5       1       1       1
mipsel-linux-gnu           9      12      -5       6       2
mipsisa64-linux-gnu        7       1      -3       1       1
mmix                       6       6      -2       4       1
mn10300-elf               20      15      -4       5       1
moxie-rtems                8      11      -2       3       1
msp430-elf                 8      24       1       6       2
nds32le-elf               91    -188     -24     136      -1
nios2-linux-gnu            2       6       1       5       1
nvptx-none               396     756       1      16       1
or1k-elf                   8      20       1       4       2
pdp11                     65     149     -10      45       2
powerpc-ibm-aix7.0      1039    1114    -366    2124      -1
powerpc64-linux-gnu      854    2753    -274    3094      -2
powerpc64le-linux-gnu    648    -551    -340     208      -1
pru-elf                    5       5      -2       3       1
riscv32-elf                7       6      -2       5       1
riscv64-elf                2       5       2       3       2
rl78-elf                  80    -648     -98      13      -4
rx-elf                    16       2      -4       5      -1
s390-linux-gnu            60    -174     -39      14      -1
s390x-linux-gnu          152    -781    -159      14      -1
sh-linux-gnu              13       5     -15       7       1
sparc-linux-gnu           29       7      -3      11      -1
sparc64-linux-gnu         51       1      -8      15      -1
tilepro-linux-gnu        119    -567    -164      15      -2
v850-elf                   4       4      -1       3       1
vax-netbsdelf             10      13      -4       5       1
visium-elf                 4       0      -5       3       1
x86_64-darwin              7     -12      -9       4      -2
x86_64-linux-gnu           6     -11      -6       4      -2
xstormy16-elf             10      13       1       2       1
xtensa-elf                 6       8      -1       2       2

which definitely shows up some outliers I need to look at.  The second
set is for:

(B) --param run-combine=6 (both passes), use-use combinations only
(C) --param run-combine=6 (both passes), no restrictions

Target                 Tests   Delta    Best   Worst  Median
======                 =====   =====    ====   =====  ======
aarch64-linux-gnu        272   -3844    -585      18      -1
aarch64_be-linux-gnu     190   -3336    -370      18      -1
alpha-linux-gnu          401   -2735    -370      22      -2
amdgcn-amdhsa            188    1867    -484    1259      -1
arc-elf                  257   -1498    -650      54      -1
arm-linux-gnueabi        168   -1117    -612     680      -1
arm-linux-gnueabihf      168   -1117    -612     680      -1
avr-elf                 1341 -111401  -13824     680     -10
bfin-elf                1346  -18950   -8461     465      -2
bpf-elf                   63    -496     -60       3      -2
c6x-elf                  179  -10527  -10084      41      -2
cr16-elf                1616  -51479  -10657      42     -13
cris-elf                 113    -533     -84       4      -2
csky-elf                 129   -3399    -474       1      -2
epiphany-elf             151    -375    -149      84      -1
fr30-elf                 155   -1773    -756     289      -2
frv-linux-gnu            808  -13332   -2074      67      -1
ft32-elf                 276   -1688    -111      -1      -2
h8300-elf                527  -11522   -1747      68      -3
hppa64-hp-hpux11.23      179    -865    -142      34      -1
i686-apple-darwin        335   -1266     -56      44      -1
i686-pc-linux-gnu        222   -2216    -556      32      -1
ia64-linux-gnu           122   -4793   -1134      40      -5
iq2000-elf               171   -1341     -61       3      -2
lm32-elf                 187   -1814    -316      47      -2
m32r-elf                  70    -597     -98      11      -2
m68k-linux-gnu           197   -2375    -332     148      -2
mcore-elf                125   -1236    -146       7      -1
microblaze-elf           442   -4498   -2094      32      -2
mipsel-linux-gnu         125   -2050    -222      60      -2
mipsisa64-linux-gnu      107   -2015    -130      14      -2
mmix                     103    -239     -26       4      -1
mn10300-elf              215   -1039    -234      80      -1
moxie-rtems              149    -754     -79       4      -2
msp430-elf               180    -600     -63      19      -1
nds32le-elf              183    -287     -37      32      -1
nios2-linux-gnu           81    -329     -66       4      -1
nvptx-none               200   -1882    -208      -2      -2
or1k-elf                  57    -317     -25       2      -1
pdp11                    207   -1441    -182      83      -2
powerpc-ibm-aix7.0       400   -4145    -271      14      -2
powerpc64-linux-gnu      375   -2062    -160     117      -2
powerpc64le-linux-gnu    491   -4169    -700     156      -2
pru-elf                   47   -7020   -6921       6      -1
riscv32-elf               59   -1379    -139       7      -2
riscv64-elf               89   -1562    -264       7      -1
rl78-elf                 289  -16157   -1665      42      -6
rx-elf                    82    -195     -53       8      -1
s390-linux-gnu           128   -2108   -1485      63      -1
s390x-linux-gnu          112     418     -32     522      -1
sh-linux-gnu             218    -410    -108      68      -1
sparc-linux-gnu          141    -866     -99      18      -1
sparc64-linux-gnu        129    -792    -102       3      -2
tilepro-linux-gnu        953   -4331    -297     332      -2
v850-elf                  50    -412     -53       2      -3
vax-netbsdelf            254   -3328    -400       4      -2
visium-elf               100    -693    -138      16      -1
x86_64-darwin            345   -2134    -490      72      -1
x86_64-linux-gnu         307    -843    -288     210      -1
xstormy16-elf            218    -788    -156      59      -1
xtensa-elf               195   -1426    -322      36       1

So the main benefit does seem to come from the def-use part.

Here are some powerpc64le-linux-gnu examples of (B)->(C):

gcc.c-torture/execute/20171008-1.c:

@@ -79,8 +79,7 @@
        stdu 1,-32(1)
 .LCFI5:
        bl foo
-       rlwinm 3,3,0,0xff
-       cmpwi 0,3,0
+       andi. 9,3,0xff
        bne 0,.L13
        addi 1,1,32

gcc.c-torture/execute/pr28982a.c:

@@ -427,15 +427,13 @@
        stxvd2x 0,7,6
        .align 4
 .L9:
-       xxlor 12,32,32
+       xvcvsxwsp 0,32
        vadduwm 0,0,1
        addi 10,9,16
-       xvcvsxwsp 0,12
-       xxlor 12,32,32
+       stxvd2x 0,0,9
+       xvcvsxwsp 0,32
+       addi 9,9,32
        vadduwm 0,0,1
-       stxvd2x 0,0,9
-       xvcvsxwsp 0,12
-       addi 9,9,32
        stxvd2x 0,0,10
        bdnz .L9
        li 3,4

(Disclaimer: I have no idea if that's correct.)

gcc.c-torture/execute/pr65215-3.c:

@@ -56,11 +56,10 @@
        srdi 10,3,32
        srdi 9,3,56
        slwi 6,10,24
-       srwi 7,10,8
+       rlwinm 7,10,24,16,23
        or 9,9,6
-       rlwinm 7,7,0,16,23
+       rlwinm 10,10,8,8,15
        or 9,9,7
-       rlwinm 10,10,8,8,15
        or 9,9,10
        cmpw 0,9,8
        bne 0,.L4

Just to emphasise though: I'm not proposing that we switch this on for
all targets yet.  It would be opt-in until the pass is more mature.
But that FFR case is really important for the situation it handles.

Thanks,
Richard

Re: Add a new combine pass

Reply via email to