Re: [avr-gcc-list] Re: C vs. assembly performance

Georg-Johann Lay Sat, 28 Feb 2009 08:03:54 -0800

Nicholas Vinen schrieb:

Georg-Johann Lay wrote:

If you are sure it is really some performance issue/regression and not
due to some language standard implication, you can add a report to
   http://sourceforge.net/tracker/?group_id=68108

so that the subject won't be forgotten. Also mind
   http://gcc.gnu.org/bugs.html

And of course, you can ask questions here. In that case it is helpful
if you can manage to simplify the source to a small piece of code that
triggers the problem and allows others to reproduce the problem. (i.e.
no #include in the code, no ... (except for varargs), a.s.o).

Snippets of .s may point to the problem when you add -dp -fverbose-asm

And there are lots of places where avr-gcc produces suboptimal or even
bad code, so feedback is welcome.

But note that just a few guys are working on the AVR part of gcc.
I would do more if I had the time (and the support of some gurus to
ask questions on internals then and when...)


OK, I only spent a few minutes looking at old code and I found some
obviously sub-optimal results. It distills down to this:

#include <avr/io.h>

int main(void) {
  unsigned long packet = 0;

  while(1) {
    if( !(PINC & _BV(PC2)) ) {
      packet = (packet<<1)|(((unsigned char)PINC>>1)&1);
    }
    PORTB = packet;
  }
}

avr-gcc is: avr-gcc (Gentoo 4.3.3 p1.0, pie-10.1.5) 4.3.3

The avr/io stuff is just so that it won't optimise the code away to nothing.


Please avoid the #include stuff. You can use source like this:

#define PINC  (*((unsigned char volatile*) 0x20))
#define PORTB (*((unsigned char volatile*) 0x21))

void foo ()
{
    unsigned long packet = 0;

    while(1)
    {
        if (!(PINC & (1 << 2)))
        {
            packet = (packet<<1)|(((unsigned char)PINC>>1)&1);
        }
        PORTB = packet;
    }
}

I tried compiling it with both -Os and -O2:

avr-gcc -g -dp -fverbose-asm -Os -S -mmcu=atmega48 -o test test.c

The result includes this:

        lsl r18  ;  packet       ;  50  *ashlsi3_const/2        [length = 4]
        rol r19  ;  packet
        rol r20  ;  packet
        rol r21  ;  packet
        in r24,38-0x20   ;  D.1214,      ;  16  *movqi/4        [length = 1]
        lsr r24  ;  D.1214       ;  17  lshrqi3/3       [length = 1]
        ldi r25,lo8(0)   ; ,     ;  48  *movqi/2        [length = 1]
        ldi r26,lo8(0)   ; ,     ;  46  *movhi/4        [length = 2]
        ldi r27,hi8(0)   ; ,
        andi r24,lo8(1)  ;  tmp52,       ;  19  andsi3/2        [length = 4]
        andi r25,hi8(1)  ;  tmp52,
        andi r26,hlo8(1)         ;  tmp52,
        andi r27,hhi8(1)         ;  tmp52,
        or r18,r24       ;  packet, tmp52        ;  20  iorsi3/1        [length 
= 4]
        or r19,r25       ;  packet, tmp52
        or r20,r26       ;  packet, tmp52
        or r21,r27       ;  packet, tmp52

The problem, it seems, is that the compiler doesn't realize that the
right hand side of the expression can only have any non-zero values in
the bottom 8 bits, since it's an unsigned char which is being implicitly
expanded to 32 bits for the or operation. In fact, it's only the bottom
bit that's ever non-zero. As a result it's spending a number of cycles
and registers doing useless things. I'll copy a report to the locations
you mention in your e-mail.

Note that this may partially be covered by report 145284 (which I cannotfind, maybe Eric has closed/removed it)


I already filed a patch for that in
   http://lists.gnu.org/archive/html/avr-gcc-list/2008-12/msg00019.html

that covers your issue to some extent or maybe almost complete: The newpattern "*ior<MODE>2_<MODE>bit0" would match some parts of your code.

There are probably ways to work around this, such as making "packet" a
union of an unsigned char and a long, then shifting the long and only
ORing in the unsigned char. I'll note that there's also an optimization
to be had with the right hand side of the expression. I would write the
assembly something like this:

>    lsl r18
>    rol r19
>    rol r20
>    rol r21
>    in r24,38-0x20
>    bst r24, 1
>    bld r18, 0

The result of the above patch should lead to something like
        lsl r18
        rol r19
        rol r20
        rol r21
        in r24,38-0x20
        bst r24, 1
        sbrs r18, 0
        bld r18, 0

The SBRS is necessary, because the pattern is not aware of the fact thatr18.0 is 0. Maybe the optimization is even better (or waeker); I am notusing that patch at the moment and can just estimate its effect whenpeeking into rtl dumps of an unpatched gcc.

Concerning the patch itself, I don't know anything about its fate and ifit will ever make its way into gcc because of administrative obstaclesand the technique I used.

I don't like the technique I am using because it leads to complexpatterns that are hard to understand an test and will become useless ifthe middleend decides to represent the stuff in a slightly different way...


Georg-Johann



_______________________________________________
AVR-GCC-list mailing list
AVR-GCC-list@nongnu.org
http://lists.nongnu.org/mailman/listinfo/avr-gcc-list

Re: [avr-gcc-list] Re: C vs. assembly performance

Reply via email to