Hi, On 4/20/09, Georg-Johann Lay <a...@gjlay.de> wrote: > > Hi Sean, > > as far as I understand, your tool runs on asm code, i.e. you map short > sequences of asm instructions (which must not contain code labels) to > other instruction sequences. >
Right, and it's far from complete. In theory it could handle code with labels and branches but I haven't implemented that yet, and it would be slower for processing.. but I have ideas for how to do it with reasonable efficiency. > The lookup table is generated by brute force? Observing the results of > asm sequences and trying to map to an other sequence that shows better > performance depending on the cost function (time, space, ...). > Yes, right now it's brute force. Fortunately the lookup table can be stored to disk with a checksum of each instruction sequence which is computed in such a way that it is guaranteed to be the same checksum for a longer sequence if they are identical > Then you use patterns that actually match some pieces of gcc output to > automatically generate peephole2 resp. peephole patterns for avr-gcc? > Not quite yet. My peepholes are assembly->assembly. The peepholes for gcc are rtl. > I must admit that I am not a big fan of insn peepholes generated this > way: peepholes can fix some mess from register allocation, but what is > possible on peep level is very limited because register allocation is > finished. So you cannot get some scratch register, and if a register is > no more used after the peephole, you cannot free it and use for > something else. Moreover, if there is a peephole that matches a sequence > A, B > it won't match > A, X, B > where X is independent of A and B. > Right, this is a major concern to me. I know of a number of peepholes defined in avr.md which do not always get applied in cases that it seems like they should. > Maybe it's possible to invent your work to generate patterns for insn > combine rather that for insn peephole? That pass runs before register > allocation and allows to transform from RTL to RTL. Yes, I'm still don't think it's the perfect solution though. I have to look into this. > > Skimmin over the code abr-gcc generates for libgcc, e.g., we see much > romm for improving code both in size and in speed. > > Finally I have some comments/question on code snippets in your avrfreaks > post: > > *I* > > from: > eor r6, r6 > eor r7, r7 > eor r8, r8 > eor r9, r9 > to: > ldi r6, 0 ;; typo > eor r7, r7 > movw r8, r6 > > Besides the typo, avr-gcc already knows how to do this, see > avr.c::output_movsisf: > > AVR_HAVE_MOVW ? (AS1 (clr,%A0) CR_TAB > AS1 (clr,%B0) CR_TAB > AS2 (movw,%C0,%A0)) > : (AS1 (clr,%A0) CR_TAB > AS1 (clr,%B0) CR_TAB > AS1 (clr,%C0) CR_TAB > AS1 (clr,%D0)); > > Maybe you did -fsplit-wide-types? In many situations > -fno-split-wide-types yields better code, but not always. Without > splitting wide types RTL is sometimes a bit unhandy because it has to > deal with larger entities, but with splitting wide types it's harder to > keep track of the bigger picture. > I did not know about this optimization. I was just using -Os. I will try it. Also, that peephole only works for 32bit numbers correct? What if there happen to be 2 16 bit ones? Or even 4 8bit numbers that happen to be able to benefit from this. Also what if you want to load 0x3bd3 into the upper and lower half using ldi, ldi, movw? Currently gcc just does 4 ldis > *II* > > from: > ldi r16, 101 > ori r16, 50 > swap r16 > to: > ldi r16, 255 > andi r16, 119 > > I am a little bit confused. Is the source an output of avr-gcc with > optimization turned on? If so, we should find out why the compiler > generates this code and remove the cause rather than the symptom and say > "ldi r16, 119". > Sorry the second set of examples are ones I manually entered.. gcc did not generate it. I wanted to demonstrate the capability of my superoptimizer has for solving constants (pretty cool isn't it?) And also it's very fast since it doesn't brute force the constants, it actually solves systems of binary equations (like a variable for each bit) With brute force you are limited to 2-3 8bit constants and even then it's a very long runtime. > *III* > from: > mov r24, r25 > ldi r25, 0 > mov r25, r24 > eor r24, r24 > to: > eor r24, r24 > > The original snippet would look something like > (set (reg:HI 101) (zero_extend:HI (reg:QI 100))) > (set (reg:HI 102) (ashift:HI (reg:HI 101) (const_int 8))) > Combine will try > (set (reg:HI 102) > (ashift:HI (zero_extend:HI (reg:QI 100)) > (const_int 8))) > Which could be split/mapped to > (set (subreg:QI (reg:HI 102) 1) (reg:QI 100)) > (set (subreg:QI (reg:HI 102) 0) (const_int 0)) > Note that this is more general and contains the code above if reload > decides to allocate 102 to R24 and 100 to R25 (or 102 to REGNO and 100 > to REGNO+1). > > Unfortunately, skimming generated asm for such sequences and writing > patterns to catch them is very time consuming. But I am unsure if > automatically generated peepholes2 is what we want > -- there will be bulk of patterns in the backend where no one really > knows where they come from. There will be no individual comments why > they are there. It will be harder to maintain the backend. I would specify them as generated and keep them in a separate file. > -- I think before adding peepholes we should try to fix the very > problems: maybe missing combine patterns, playing around with command > line options, smarter ways to printout assembler, maybe costs, insn > constraints, see if the bad code still persist in gcc 4.5, etc. Yes I agree, it is better to handcode a few patterns to take care of 90% of the cases than to have a few hundred generated cases. > > As I said, IMHO peep2 should be a last resort to fix mess if more > sophisticated and more general approaches fail. I guess a bunch of the > cases you see and treat is just because gcc doesn't handle what it could > have handled if it was described somewhere. > Yes, and it's annoying the way gcc is 32bit centric so it means all the patterns have to be duplicated for 8, 16, and 32 bits on avr. Maybe I'm getting carried away, but Ideally gcc would figure out how to add 32bit numbers if it knows how to add 8 bit ones.. it should be able to generate multiplication and division routines using it's knowledge of the assembly language.. and then it would be trivial to support 24bit integers, or fixed point types of any size for any target (It was a pain writing all the multiplication and division routines for 8 different types of fixed point numbers) If it could do this type of thing, it would significantly speed up writing new backends since you would only need to define the instruction set to the compiler, not rtl to the instruction set. Sean _______________________________________________ AVR-GCC-list mailing list AVR-GCC-list@nongnu.org http://lists.nongnu.org/mailman/listinfo/avr-gcc-list