[target.md]: Backend passes cause code bloat of +140%.

Georg-Johann Lay Wed, 21 Jan 2009 09:04:42 -0800

Hi, 

I am trying to write some simple builtin functions for target avr.


The buitins themselves are no proplem. The expansion works as intended.

What is unacceptable is a code bloat of +100% ... +150% during the RTL
passes.

So can anyone assist me in writing down RTL that will yield best/acceptable
performance?

The background is this:

AVR is an 8 bit Harvard Architecture with fairly limited resources,
i.e. some kByte of flash memory that hold program code and constant
data to be loaded at runtime.

Constant data can be loaded into a GPR by one instruction
LPM (load program memory) that comes in two flavours:

 LPM  Rn, Z     ; char Rn = *Z
 LPM  Rn, Z+    ; char Rn = *Z++

where Rn is class "r" and Z must be one specific 16 bit register,
a combination of 2 subsequent 8 bit registers. Z is call clobbered.

For tiny targets there is just a

 LPM           ; char R0 = *Z

with the two implicit registers Z as above and R0 a fixed reg.

Up to now, LPM is not implemented in the compiler because GCC does not
allow to add new target specific qualifiers like "flash". 
Therefore, the LPM stuff is implemented on libc-level as inline asm
macros.

The disadvantage of inline asm is that GCC does not know what is
going on in LPM and can take advantage of LPM Rn, Z++ and reuse Z.

So I tried writing a builtin "__builtin_pgm_read_byte". 
Suppose the following trivial C test source:

char u[4];

void u1_dir (unsigned char * q)
{
    u[0] = __builtin_pgm_read_byte (q);
}

char * u2_ind (char *u, unsigned char * q)
{
    *u++ = __builtin_pgm_read_byte (q++);
    *u++ = __builtin_pgm_read_byte (q++);

    return u;
}

Using -Os the first function compiles to:

u1_dir:
/* prologue: function */
/* frame size = 0 */
        movw r30,r24     ;  2   *movhi/1        [length = 1]
        lpm r24, Z+      ;  7   lpmZ_postinc_1  [length = 1]
        sts u,r24        ;  9   *movqi/3        [length = 2]
/* epilogue start */
        ret      ;  20  return  [length = 1]

Fore the next test function you will expect code like that:

u2_ind:
        movw r26, r24    ;  X = arg #1
        movw r30, r22    ;  Z = arg #2
        
        ; with Rn some GPR
        lpm Rn, Z+       ;  Rn = *Z++
        st  X+, Rn       ;  *X++ = Rn

        lpm Rn, Z+       ;  Rn = *Z++
        st  X+, Rn       ;  *X++ = Rn

        movw r24, r26    ; return = X
        ret

However, the resulting code is unacceptable both from program memory
and from execution time wasted:

u2_ind:
        push r28         ;  34  *pushqi/1       [length = 1]
        push r29         ;  35  *pushqi/1       [length = 1]
/* prologue: function */The patch used UNSPEC in order to keep changes in the 
backend minimal
/* frame size = 0 */
        movw r26,r24     ;  2   *movhi/1        [length = 1]
        movw r30,r22     ;  7   *movhi/1        [length = 1]
        lpm r24, Z+      ;  8   lpmZ_postinc_1  [length = 1]
        movw r28,r26     ;  30  *movhi/1        [length = 1]
        st Y+,r24        ;  9   *movqi/3        [length = 1]
        movw r30,r22     ;  32  *movhi/1        [length = 1]
        adiw r30,1       ;  11  *addhi3/2       [length = 1]
        lpm r24, Z+      ;  13  lpmZ_postinc_1  [length = 1]
        adiw r26,1       ;  14  *movqi/3        [length = 3]
        st X,r24
        sbiw r26,1
        movw r24,r28     ;  33  *movhi/1        [length = 1]
        adiw r24,1       ;  20  *addhi3/2       [length = 1]
/* epilogue start */
        pop r29  ;  38  popqi   [length = 1]
        pop r28  ;  39  popqi   [length = 1]
        ret      ;  40  return_from_epilogue    [length = 1]
 
Not to imagine the impact on a real world application's performace.

What is going wrong?

I tried playing around with various RTL representations of LPM.
I also tried various command line switches like

 -O2
 -Os
 -fno-split-wide-types 
 -fno-tree-scev-cprop

a.s.o.

Can anyoune give me some hints and teach me how to write the RTL
correctly?

I do not intend to mess around with a newly introduced pointer mode
like PHImode. Or would this bring the break through by avoiding
UNSPEC?

The patch is against 143546.

Regards, Georg-Johann

avr-__builtin_pgm_read_byte.patch
Description: Binary data

[target.md]: Backend passes cause code bloat of +140%.

Reply via email to