Re: Add a new combine pass

Segher Boessenkool Wed, 20 Nov 2019 12:46:26 -0800

On Wed, Nov 20, 2019 at 06:20:34PM +0000, Richard Sandiford wrote:
> Segher Boessenkool <seg...@kernel.crashing.org> writes:
> > So this would work if you had pseudos here, instead of the hard reg?
> > Because it is a hard reg it is the same number in both places, making it
> > hard to move.
> 
> Yeah, probably.  But the hard reg is a critical part of this.
> Going back to the example:
> 
> (set (reg/v:VNx16BI 102 [ ok ])
>      (reg:VNx16BI 85 ffrt))
> (set (reg:VNx16BI 85 ffrt)
>      (unspec:VNx16BI [(reg:VNx16BI 85 ffrt)] UNSPEC_UPDATE_FFRT))
> (set (reg:CC_NZC 66 cc)
>      (unspec:CC_NZC
>        [(reg:VNx16BI 106) repeated x2
>         (const_int 1 [0x1])
>         (reg/v:VNx16BI 102 [ ok ])] UNSPEC_PTEST))
> 
> FFR is the real first fault register.  FFRT is actually a fake register
> whose only purpose is to describe the dependencies (in rtl) between writes
> to the FFR, reads from the FFR and first-faulting loads.  The whole scheme
> depends on having only one fixed FFRT register.


Right.  The reason this cannot work in combine is that combine always
combines to just *one* insn, at i3; later, if it turns out that it needs
to split it, it can put something at i2.  But that doesn't even happen
here, only the first and the last of those three insns are what is
combined.

It is important combine only moves things forward in the insn stream, to
make sure this whole process is finite.  Or this was true years ago, at
least :-)

> > Why don't you use DF for the DU chains?
> 
> The problem with DF_DU_CHAIN is that it's quadratic in the worst case.

Oh, wow.

> fwprop.c gets around that by using the MD problem and having its own
> dominator walker to calculate limited def-use chains:
> 
>   /* We use the multiple definitions problem to compute our restricted
>      use-def chains.  */

It's not great if every pass invents its own version of some common
infrastructure thing because that common one is not suitable.

I.e., can this be fixed somehow?  Maybe just by having a restricted DU
chains df problem?

> So taking that approach here would still require some amount of
> roll-your-own.  Other reasons are:
> 
> * Even what fwprop does is more elaborate than we need for now.
> 
> * We need to handle memory too, and it's nice to be able to handle
>   it in the same way as registers.
> 
> * Updating a full, ordered def-use chain after a move is a linear-time
>   operation, so whatever happens, we'd need to apply some kind of limit
>   on the number of uses we maintain, with something like that integer
>   point range for the rest.
> 
> * Once we've analysed the insn and built its def-use chains, we don't
>   look at the df_refs again until we update the chains after a successful
>   combination.  So it should be more efficient to maintain a small array
>   of insn_info_rec pointers alongside the numerical range, rather than
>   walk and pollute chains of df_refs and then link back the insn uids
>   to the pass-local info.

So you need something like combine's LOG_LINKS?  Not that handling those
is not quadratic in the worst case, but in practice it works well.  And
it *could* be made linear.

> >> Tracking limited def-use chains is what makes this last bit easy.
> >> We can just try parallelising two instructions from the (bounded) list
> >> of uses.  And for this case there's not any garbage rtl involved, since
> >> we reuse the same PARALLEL rtx between attempts.  The cost is basically
> >> all in the recog call (which would obviously mount up if we went
> >> overboard).
> >
> > *All* examples above and below are just this.
> 
> Yeah, the powerpc and s390x examples were.  The motivating FFR example
> above isn't though: it's a def-use combination in parallel with the
> existing definition.

Right, good point :-)

> >> >> To get a feel for the effect on multiple targets, I did my usual
> >> >> bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg
> >> >> and g++.dg, this time comparing run-combine=2 and run-combine=6
> >> >> using -O2 -ftree-vectorize:
> >> >
> >> > One problem with this is that these are very short functions on average.
> >> 
> >> There are some long ones too :-)
> >
> > Yes, but this isn't a good stand-in for representative programs.
> 
> Right.  And number of lines of asm isn't a good stand-in for anything much.

For combine, number of insns generated is a surprisingly good measure of
how it performed.  Sometimes not, when it goes over a border of an
inlining decision, say, or bb-reorder decides to duplicate more because
it is cheaper now.

> Like I say, the whole thing is just to get a feel, on tests that are readily
> to hand and are easy to compile without a full toolchain.

Absolutely.  But I have no experience with using your test set, so the
numbers do not necessarily mean so much to me :-)

> > So I'd love to see statistics for *only* combining two uses of the same
> > thing, this is something combine cannot do, and arguably *shouldn't* do!
> 
> OK, here are two sets of results.  The first is for:
> 
> (A) --param run-combine=2 (current combine only)
> (B) --param run-combine=6 (both passes), use-use combinations only
> 
> Target                 Tests   Delta    Best   Worst  Median
> ======                 =====   =====    ====   =====  ======
> aarch64-linux-gnu        158    3060     -72     520      -1
> aarch64_be-linux-gnu     111      24     -57     324      -1
> alpha-linux-gnu            3       3       1       1       1
> amdgcn-amdhsa             18      71     -17      26       1
> arc-elf                  310   -4414   -1516     356       1
> arm-linux-gnueabi         28     -50     -13       3      -1
> arm-linux-gnueabihf       28     -50     -13       3      -1
> avr-elf                   26     308      -1      36      12
> bfin-elf                   6       8      -1       3       1
> bpf-elf                   10      21      -1       6       1
> c6x-elf                    7       9      -6       6       1
> cr16-elf                  13     102       1      27       2
> cris-elf                  35   -1001    -700       3      -2
> csky-elf                   9      28       1       6       2
> epiphany-elf              29     -29      -2       1      -1
> fr30-elf                  12      17      -1       5       1
> frv-linux-gnu              1      -2      -2      -2      -2
> ft32-elf                  10      22      -1       5       2
> h8300-elf                 29      56     -22      14       2
> hppa64-hp-hpux11.23        9      17      -1       4       2
> i686-apple-darwin         10     -33     -20      12      -2
> i686-pc-linux-gnu         41     243     -12      33       3
> ia64-linux-gnu            28     -32     -29      39      -4
> iq2000-elf                 6       8       1       2       1
> lm32-elf                  10      12      -3       5       1
> m32r-elf                   3       2      -2       2       2
> m68k-linux-gnu            19      27      -2       5       2
> mcore-elf                 14      23     -10       6       2
> microblaze-elf             5       5       1       1       1
> mipsel-linux-gnu           9      12      -5       6       2
> mipsisa64-linux-gnu        7       1      -3       1       1
> mmix                       6       6      -2       4       1
> mn10300-elf               20      15      -4       5       1
> moxie-rtems                8      11      -2       3       1
> msp430-elf                 8      24       1       6       2
> nds32le-elf               91    -188     -24     136      -1
> nios2-linux-gnu            2       6       1       5       1
> nvptx-none               396     756       1      16       1
> or1k-elf                   8      20       1       4       2
> pdp11                     65     149     -10      45       2
> powerpc-ibm-aix7.0      1039    1114    -366    2124      -1
> powerpc64-linux-gnu      854    2753    -274    3094      -2
> powerpc64le-linux-gnu    648    -551    -340     208      -1
> pru-elf                    5       5      -2       3       1
> riscv32-elf                7       6      -2       5       1
> riscv64-elf                2       5       2       3       2
> rl78-elf                  80    -648     -98      13      -4
> rx-elf                    16       2      -4       5      -1
> s390-linux-gnu            60    -174     -39      14      -1
> s390x-linux-gnu          152    -781    -159      14      -1
> sh-linux-gnu              13       5     -15       7       1
> sparc-linux-gnu           29       7      -3      11      -1
> sparc64-linux-gnu         51       1      -8      15      -1
> tilepro-linux-gnu        119    -567    -164      15      -2
> v850-elf                   4       4      -1       3       1
> vax-netbsdelf             10      13      -4       5       1
> visium-elf                 4       0      -5       3       1
> x86_64-darwin              7     -12      -9       4      -2
> x86_64-linux-gnu           6     -11      -6       4      -2
> xstormy16-elf             10      13       1       2       1
> xtensa-elf                 6       8      -1       2       2
> 
> which definitely shows up some outliers I need to look at.

Yeah, huh, it's all over the map.

> The second set is for:
> 
> (B) --param run-combine=6 (both passes), use-use combinations only
> (C) --param run-combine=6 (both passes), no restrictions
> 
> Target                 Tests   Delta    Best   Worst  Median
> ======                 =====   =====    ====   =====  ======
> aarch64-linux-gnu        272   -3844    -585      18      -1
> aarch64_be-linux-gnu     190   -3336    -370      18      -1
> alpha-linux-gnu          401   -2735    -370      22      -2
> amdgcn-amdhsa            188    1867    -484    1259      -1
> arc-elf                  257   -1498    -650      54      -1
> arm-linux-gnueabi        168   -1117    -612     680      -1
> arm-linux-gnueabihf      168   -1117    -612     680      -1
> avr-elf                 1341 -111401  -13824     680     -10

Things like this are kind of suspicious :-)

> bfin-elf                1346  -18950   -8461     465      -2
> bpf-elf                   63    -496     -60       3      -2
> c6x-elf                  179  -10527  -10084      41      -2
> cr16-elf                1616  -51479  -10657      42     -13
> cris-elf                 113    -533     -84       4      -2
> csky-elf                 129   -3399    -474       1      -2
> epiphany-elf             151    -375    -149      84      -1
> fr30-elf                 155   -1773    -756     289      -2
> frv-linux-gnu            808  -13332   -2074      67      -1
> ft32-elf                 276   -1688    -111      -1      -2
> h8300-elf                527  -11522   -1747      68      -3
> hppa64-hp-hpux11.23      179    -865    -142      34      -1
> i686-apple-darwin        335   -1266     -56      44      -1
> i686-pc-linux-gnu        222   -2216    -556      32      -1
> ia64-linux-gnu           122   -4793   -1134      40      -5
> iq2000-elf               171   -1341     -61       3      -2
> lm32-elf                 187   -1814    -316      47      -2
> m32r-elf                  70    -597     -98      11      -2
> m68k-linux-gnu           197   -2375    -332     148      -2
> mcore-elf                125   -1236    -146       7      -1
> microblaze-elf           442   -4498   -2094      32      -2
> mipsel-linux-gnu         125   -2050    -222      60      -2
> mipsisa64-linux-gnu      107   -2015    -130      14      -2
> mmix                     103    -239     -26       4      -1
> mn10300-elf              215   -1039    -234      80      -1
> moxie-rtems              149    -754     -79       4      -2
> msp430-elf               180    -600     -63      19      -1
> nds32le-elf              183    -287     -37      32      -1
> nios2-linux-gnu           81    -329     -66       4      -1
> nvptx-none               200   -1882    -208      -2      -2
> or1k-elf                  57    -317     -25       2      -1
> pdp11                    207   -1441    -182      83      -2
> powerpc-ibm-aix7.0       400   -4145    -271      14      -2
> powerpc64-linux-gnu      375   -2062    -160     117      -2
> powerpc64le-linux-gnu    491   -4169    -700     156      -2
> pru-elf                   47   -7020   -6921       6      -1
> riscv32-elf               59   -1379    -139       7      -2
> riscv64-elf               89   -1562    -264       7      -1
> rl78-elf                 289  -16157   -1665      42      -6
> rx-elf                    82    -195     -53       8      -1
> s390-linux-gnu           128   -2108   -1485      63      -1
> s390x-linux-gnu          112     418     -32     522      -1
> sh-linux-gnu             218    -410    -108      68      -1
> sparc-linux-gnu          141    -866     -99      18      -1
> sparc64-linux-gnu        129    -792    -102       3      -2
> tilepro-linux-gnu        953   -4331    -297     332      -2
> v850-elf                  50    -412     -53       2      -3
> vax-netbsdelf            254   -3328    -400       4      -2
> visium-elf               100    -693    -138      16      -1
> x86_64-darwin            345   -2134    -490      72      -1
> x86_64-linux-gnu         307    -843    -288     210      -1
> xstormy16-elf            218    -788    -156      59      -1
> xtensa-elf               195   -1426    -322      36       1
> 
> So the main benefit does seem to come from the def-use part.
> 
> Here are some powerpc64le-linux-gnu examples of (B)->(C):
> 
> gcc.c-torture/execute/20171008-1.c:
> 
> @@ -79,8 +79,7 @@
>         stdu 1,-32(1)
>  .LCFI5:
>         bl foo
> -       rlwinm 3,3,0,0xff
> -       cmpwi 0,3,0
> +       andi. 9,3,0xff
>         bne 0,.L13
>         addi 1,1,32

Soo this starts as

insn_cost 4 for     6: r118:SI=r124:SI
      REG_DEAD r124:SI
insn_cost 4 for     8: r121:SI=zero_extend(r118:SI#0)
      REG_DEAD r118:SI
insn_cost 4 for     9: r122:CC=cmp(r121:SI,0)
      REG_DEAD r121:SI

and then it combines 6->8 of course, but then

Trying 8 -> 9:
    8: r121:SI=zero_extend(r124:SI#0)
      REG_DEAD r124:SI
    9: r122:CC=cmp(r121:SI,0)
      REG_DEAD r121:SI
Failed to match this instruction:
(set (reg:CC 122)
    (compare:CC (subreg:QI (reg:SI 124) 0)
        (const_int 0 [0])))

Hrm, that is a bad idea in general, why do we do that.

> gcc.c-torture/execute/pr28982a.c:
> 
> @@ -427,15 +427,13 @@
>         stxvd2x 0,7,6
>         .align 4
>  .L9:
> -       xxlor 12,32,32
> +       xvcvsxwsp 0,32
>         vadduwm 0,0,1
>         addi 10,9,16
> -       xvcvsxwsp 0,12
> -       xxlor 12,32,32
> +       stxvd2x 0,0,9
> +       xvcvsxwsp 0,32
> +       addi 9,9,32
>         vadduwm 0,0,1
> -       stxvd2x 0,0,9
> -       xvcvsxwsp 0,12
> -       addi 9,9,32
>         stxvd2x 0,0,10
>         bdnz .L9
>         li 3,4
> 
> (Disclaimer: I have no idea if that's correct.)

This seems to be -O3 -mcpu=power8.

It look to be correct just fine.  It saves the two xxlor insns (which are
just register move instructions).  RA couldn't fix this up because there
are two uses of the register (the xvcvsxwsp -- convert V4SI to V4SF; and
the vadduwm -- V4SI addition), and RA doesn't reorder code.

I would say your new code is better.

> gcc.c-torture/execute/pr65215-3.c:
> 
> @@ -56,11 +56,10 @@
>         srdi 10,3,32
>         srdi 9,3,56
>         slwi 6,10,24
> -       srwi 7,10,8
> +       rlwinm 7,10,24,16,23
>         or 9,9,6
> -       rlwinm 7,7,0,16,23
> +       rlwinm 10,10,8,8,15
>         or 9,9,7
> -       rlwinm 10,10,8,8,15
>         or 9,9,10
>         cmpw 0,9,8
>         bne 0,.L4

insn_cost 4 for    15: r139:SI=r118:DI#0 0>>0x8
insn_cost 4 for    16: r140:SI=r139:SI&0xff00
      REG_DEAD r139:SI

(that's both of those insns setting r7).  r118 does not die here yet,
that's only in the 10,10 insn.

Trying 15 -> 16:
   15: r139:SI=r118:DI#0 0>>0x8
   16: r140:SI=r139:SI&0xff00
      REG_DEAD r139:SI
Failed to match this instruction:
(set (reg:SI 140)
    (and:SI (subreg:SI (zero_extract:DI (reg:DI 118 [ _2 ])
                (const_int 32 [0x20])
                (const_int 24 [0x18])) 0)
        (const_int 65280 [0xff00])))
Failed to match this instruction:
(set (reg:SI 140)
    (and:SI (subreg:SI (and:DI (lshiftrt:DI (reg:DI 118 [ _2 ])
                    (const_int 8 [0x8]))
                (const_int 4294967295 [0xffffffff])) 0)
        (const_int 65280 [0xff00])))

Yeah, it's one of those, make_compound_insn :-/  (*^%$(*^$(*@^

> Just to emphasise though: I'm not proposing that we switch this on for
> all targets yet.  It would be opt-in until the pass is more mature.

I do have to wonder if it is a bit late for stage 1.  But opt-in as in,
the user has to use some flag, that should be fine I guess?  But default
for some targets might not be so great, esp. primary targets.

> But that FFR case is really important for the situation it handles.

Yeah.

I hope to have some time to review your actual patch soon.  Should be
less depressing than some of the combine failures :-)


Segher

Re: Add a new combine pass

Reply via email to