Hi, The code produced by GCC for the RL78 target is around twice as large as that produced by IAR and I've been trying to find out why.
The project I'm working on uses an RL78/F12 with 16KB of code flash. As I have to get a bootloader and an application into that, I have to pay close attention to how large the code is becoming. Looking at the assembler output for some simple examples, the problem seems to be 'bloated' code as opposed to not squeezing every last byte out through the use of ingenious optimization tricks. I've managed to build GCC myself so that I could experiment a bit but as this is my first foray into compiler internals, I'm struggling to work out how things fit together and what affects what. My initial impression is that significant gains could be made by clearing away some low-hanging fruit, but without understanding what caused that code to be generated in the first place, it's hard to do anything about it. In particular, I'd be interested to know what is caused (or could be improved) by the RL78-specific code, and what comes from the generic part of GCC. Here's an example extracted from one of the functions in our project: -------- unsigned short gOrTest; #define SOE0 (*(volatile unsigned short *)0xF012A) void orTest() { SOE0 |= 3; /* gOrTest |= 3; */ } -------- This produces the following code (using -Os): 29 0000 C9 F2 2A 01 movw r10, #298 30 0004 AD F2 movw ax, r10 31 0006 16 movw hl, ax 32 0007 AB movw ax, [hl] 33 0008 BD F4 movw r12, ax 34 000a 60 mov a, x 35 000b 6C 03 or a, #3 36 000d 9D F0 mov r8, a 37 000f 8D F5 mov a, r13 38 0011 9D F1 mov r9, a 39 0013 AD F2 movw ax, r10 40 0015 12 movw bc, ax 41 0016 AD F0 movw ax, r8 42 0018 78 00 00 movw [bc], ax 43 001b D7 ret There's so much unnecessary register passing going on there (#298 could go straight into HL, why does the same value end up in BC even though HL hasn't been touched? etc.) Commenting out the 'SOE0' line and bringing the 'gOrTest' line back in generates better code (but still worthy of optimization): 29 0000 8F 00 00 mov a, !_gOrTest 30 0003 6C 03 or a, #3 31 0005 9F 00 00 mov !_gOrTest, a 32 0008 8F 00 00 mov a, !_gOrTest+1 33 000b 6C 00 or a, #0 34 000d 9F 00 00 mov !_gOrTest+1, a 35 0010 D7 ret What causes that code to be generated when using a variable instead of a fixed memory address? Even allowing for the unnecessary 'or a, #0' and keeping to a 16-bit access, it's still possible to perform the same operation in half the space of the original: 29 0000 36 2A 01 movw hl, #298 30 0003 AB movw ax, [hl] 31 0004 75 mov d, a 32 0005 60 mov a, x 33 0006 6C 03 or a, #3 34 0008 70 mov x, a 35 0009 65 mov a, d 36 000a 6C 00 or a, #0 37 000c BB movw [hl], ax 38 000d D7 ret And, of course, that could be optimized further. Excessive register copying and an apparant preference for R8 onwards over the B,C,D,E,H and L registers (which could save a byte on every 'mov') seems to be one of the main causes of 'bloated' code (among others). So, I guess my question is how much of the bloat comes from inefficiencies in the hardware-specific code? I saw a comment in the RL78 code about performing CSE optimization but it's not clear to me where or how that would be done. I tried to look at the code for some other processors to get an idea but it's hard to find things when you don't know what you're looking for :) Any help would be gratefully received! Regards, Richard Hulme