Hi, I am trying to write some simple builtin functions for target avr.
The buitins themselves are no proplem. The expansion works as intended. What is unacceptable is a code bloat of +100% ... +150% during the RTL passes. So can anyone assist me in writing down RTL that will yield best/acceptable performance? The background is this: AVR is an 8 bit Harvard Architecture with fairly limited resources, i.e. some kByte of flash memory that hold program code and constant data to be loaded at runtime. Constant data can be loaded into a GPR by one instruction LPM (load program memory) that comes in two flavours: LPM Rn, Z ; char Rn = *Z LPM Rn, Z+ ; char Rn = *Z++ where Rn is class "r" and Z must be one specific 16 bit register, a combination of 2 subsequent 8 bit registers. Z is call clobbered. For tiny targets there is just a LPM ; char R0 = *Z with the two implicit registers Z as above and R0 a fixed reg. Up to now, LPM is not implemented in the compiler because GCC does not allow to add new target specific qualifiers like "flash". Therefore, the LPM stuff is implemented on libc-level as inline asm macros. The disadvantage of inline asm is that GCC does not know what is going on in LPM and can take advantage of LPM Rn, Z++ and reuse Z. So I tried writing a builtin "__builtin_pgm_read_byte". Suppose the following trivial C test source: char u[4]; void u1_dir (unsigned char * q) { u[0] = __builtin_pgm_read_byte (q); } char * u2_ind (char *u, unsigned char * q) { *u++ = __builtin_pgm_read_byte (q++); *u++ = __builtin_pgm_read_byte (q++); return u; } Using -Os the first function compiles to: u1_dir: /* prologue: function */ /* frame size = 0 */ movw r30,r24 ; 2 *movhi/1 [length = 1] lpm r24, Z+ ; 7 lpmZ_postinc_1 [length = 1] sts u,r24 ; 9 *movqi/3 [length = 2] /* epilogue start */ ret ; 20 return [length = 1] Fore the next test function you will expect code like that: u2_ind: movw r26, r24 ; X = arg #1 movw r30, r22 ; Z = arg #2 ; with Rn some GPR lpm Rn, Z+ ; Rn = *Z++ st X+, Rn ; *X++ = Rn lpm Rn, Z+ ; Rn = *Z++ st X+, Rn ; *X++ = Rn movw r24, r26 ; return = X ret However, the resulting code is unacceptable both from program memory and from execution time wasted: u2_ind: push r28 ; 34 *pushqi/1 [length = 1] push r29 ; 35 *pushqi/1 [length = 1] /* prologue: function */The patch used UNSPEC in order to keep changes in the backend minimal /* frame size = 0 */ movw r26,r24 ; 2 *movhi/1 [length = 1] movw r30,r22 ; 7 *movhi/1 [length = 1] lpm r24, Z+ ; 8 lpmZ_postinc_1 [length = 1] movw r28,r26 ; 30 *movhi/1 [length = 1] st Y+,r24 ; 9 *movqi/3 [length = 1] movw r30,r22 ; 32 *movhi/1 [length = 1] adiw r30,1 ; 11 *addhi3/2 [length = 1] lpm r24, Z+ ; 13 lpmZ_postinc_1 [length = 1] adiw r26,1 ; 14 *movqi/3 [length = 3] st X,r24 sbiw r26,1 movw r24,r28 ; 33 *movhi/1 [length = 1] adiw r24,1 ; 20 *addhi3/2 [length = 1] /* epilogue start */ pop r29 ; 38 popqi [length = 1] pop r28 ; 39 popqi [length = 1] ret ; 40 return_from_epilogue [length = 1] Not to imagine the impact on a real world application's performace. What is going wrong? I tried playing around with various RTL representations of LPM. I also tried various command line switches like -O2 -Os -fno-split-wide-types -fno-tree-scev-cprop a.s.o. Can anyoune give me some hints and teach me how to write the RTL correctly? I do not intend to mess around with a newly introduced pointer mode like PHImode. Or would this bring the break through by avoiding UNSPEC? The patch is against 143546. Regards, Georg-Johann
avr-__builtin_pgm_read_byte.patch
Description: Binary data