Re: [avr-gcc-list] avr superoptimizer

Sean D'Epagnier Mon, 20 Apr 2009 16:43:04 -0700

Hi,

On 4/20/09, Georg-Johann Lay <a...@gjlay.de> wrote:
>
> Hi Sean,
>
> as far as I understand, your tool runs on asm code, i.e. you map short
> sequences of asm instructions (which must not contain code labels) to
> other instruction sequences.
>


Right, and it's far from complete.  In theory it could handle code
with labels and branches but I haven't implemented that yet, and it
would be slower for processing.. but I have ideas for how to do it
with reasonable efficiency.

> The lookup table is generated by brute force? Observing the results of
> asm sequences and trying to map to an other sequence that shows better
> performance depending on the cost function (time, space, ...).
>

Yes, right now it's brute force.  Fortunately the lookup table can be
stored to disk with a checksum of each instruction sequence which is
computed in such a way that it is guaranteed to be the same checksum
for a longer sequence if they are identical

> Then you use patterns that actually match some pieces of gcc output to
> automatically generate peephole2 resp. peephole patterns for avr-gcc?
>

Not quite yet.  My peepholes are assembly->assembly.  The peepholes
for gcc are rtl.

> I must admit that I am not a big fan of insn peepholes generated this
> way: peepholes can fix some mess from register allocation, but what is
> possible on peep level is very limited because register allocation is
> finished. So you cannot get some scratch register, and if a register is
> no more used after the peephole, you cannot free it and use for
> something else. Moreover, if there is a peephole that matches a sequence
>    A, B
> it won't match
>    A, X, B
> where X is independent of A and B.
>

Right, this is a major concern to me.   I know of a number of
peepholes defined in avr.md which do not always get applied in cases
that it seems like they should.

> Maybe it's possible to invent your work to generate patterns for insn
> combine rather that for insn peephole? That pass runs before register
> allocation and allows to transform from RTL to RTL.

Yes, I'm still don't think it's the perfect solution though.  I have
to look into this.

>
> Skimmin over the code abr-gcc generates for libgcc, e.g., we see much
> romm for improving code both in size and in speed.
>
> Finally I have some comments/question on code snippets in your avrfreaks
> post:
>
> *I*
>
> from:
>     eor   r6, r6
>     eor   r7, r7
>     eor   r8, r8
>     eor   r9, r9
> to:
>     ldi   r6, 0 ;; typo
>     eor   r7, r7
>     movw   r8, r6
>
> Besides the typo, avr-gcc already knows how to do this, see
> avr.c::output_movsisf:
>
> AVR_HAVE_MOVW ? (AS1 (clr,%A0) CR_TAB
>               AS1 (clr,%B0) CR_TAB
>               AS2 (movw,%C0,%A0))
>            : (AS1 (clr,%A0) CR_TAB
>               AS1 (clr,%B0) CR_TAB
>               AS1 (clr,%C0) CR_TAB
>               AS1 (clr,%D0));
>
> Maybe you did -fsplit-wide-types? In many situations
> -fno-split-wide-types yields better code, but not always. Without
> splitting wide types RTL is sometimes a bit unhandy because it has to
> deal with larger entities, but with splitting wide types it's harder to
> keep track of the bigger picture.
>

I did not know about this optimization.  I was just using -Os.  I will try it.

Also, that peephole only works for 32bit numbers correct?  What if
there happen to be 2 16 bit ones?  Or even 4 8bit numbers that happen
to be able to benefit from this. Also what if you want to load 0x3bd3
into the upper and lower half using ldi, ldi, movw?  Currently gcc
just does 4 ldis

> *II*
>
> from:
>     ldi   r16, 101
>     ori   r16, 50
>     swap   r16
> to:
>     ldi   r16, 255
>     andi   r16, 119
>
> I am a little bit confused. Is the source an output of avr-gcc with
> optimization turned on? If so, we should find out why the compiler
> generates this code and remove the cause rather than the symptom and say
> "ldi r16, 119".
>

Sorry the second set of examples are ones I manually entered.. gcc did
not generate it.  I wanted to demonstrate the capability of my
superoptimizer has for solving constants (pretty cool isn't it?)  And
also it's very fast since it doesn't brute force the constants, it
actually solves systems of binary equations (like a variable for each
bit)  With brute force you are limited to 2-3 8bit constants and even
then it's a very long runtime.

> *III*
> from:
>     mov   r24, r25
>     ldi   r25, 0
>     mov   r25, r24
>     eor   r24, r24
> to:
>     eor   r24, r24
>
> The original snippet would look something like
>    (set (reg:HI 101) (zero_extend:HI (reg:QI 100)))
>    (set (reg:HI 102) (ashift:HI (reg:HI 101) (const_int 8)))
> Combine will try
>    (set (reg:HI 102)
>         (ashift:HI (zero_extend:HI (reg:QI 100))
>                    (const_int 8)))
> Which could be split/mapped to
>    (set (subreg:QI (reg:HI 102) 1) (reg:QI 100))
>    (set (subreg:QI (reg:HI 102) 0) (const_int 0))
> Note that this is more general and contains the code above if reload
> decides to allocate 102 to R24 and 100 to R25 (or 102 to REGNO and 100
> to REGNO+1).
>
> Unfortunately, skimming generated asm for such sequences and writing
> patterns to catch them is very time consuming. But I am unsure if
> automatically generated peepholes2 is what we want
> -- there will be bulk of patterns in the backend where no one really
> knows where they come from. There will be no individual comments why
> they are there. It will be harder to maintain the backend.

I would specify them as generated and keep them in a separate file.

> -- I think before adding peepholes we should try to fix the very
> problems: maybe missing combine patterns, playing around with command
> line options, smarter ways to printout assembler, maybe costs, insn
> constraints, see if the bad code still persist in gcc 4.5, etc.

Yes I agree, it is better to handcode a few patterns to take care of
90% of the cases than to have a few hundred generated cases.

>
> As I said, IMHO peep2 should be a last resort to fix mess if more
> sophisticated and more general approaches fail. I guess a bunch of the
> cases you see and treat is just because gcc doesn't handle what it could
> have handled if it was described somewhere.
>

Yes, and it's annoying the way gcc is 32bit centric so it means all
the patterns have to be duplicated for 8, 16, and 32 bits on avr.

Maybe I'm getting carried away, but Ideally gcc would figure out how
to add 32bit numbers if it knows how to add 8 bit ones.. it should be
able to generate multiplication and division routines using it's
knowledge of the assembly language.. and then it would be trivial to
support 24bit integers, or fixed point types of any size for any
target (It was a pain writing all the multiplication and division
routines for 8 different types of fixed point numbers)  If it could
do this type of thing, it would significantly speed up writing new
backends since you would only need to define the instruction set to
the compiler, not rtl to the instruction set.

Sean


_______________________________________________
AVR-GCC-list mailing list
AVR-GCC-list@nongnu.org
http://lists.nongnu.org/mailman/listinfo/avr-gcc-list

Re: [avr-gcc-list] avr superoptimizer

Reply via email to