On 11/03/14 01:40, DJ Delorie wrote:
I'm curious. Have you tried out other approaches before you decided
to go with the virtual registers?
Yes. Getting GCC to understand the "unusual" addressing modes the
RL78 uses was too much for the register allocator to handle. Even
when the addressing modes are limited to "usual" ones, GCC doesn't
have a good way to do regalloc and reload when there are limits on
what registers you can use in an address expression, and it's worse
when there are dependencies between operands, or limited numbers of
address registers.
Is it possible that the virtual pass causes inefficiencies in some cases
by sticking with r8-r31 when one of the 'normal' registers would be better?
For example, I'm having a devil of a time convincing the compiler that
an immediate value can be stored directly in any of the normal 16-bit
registers (e.g. 'movw hl, #123'). I'm beginning to wonder whether it's
the unoptimized code being fed in that's causing problems.
Taking a slight variation on my original test code (removing the
'volatile' keyword and accessing an 8-bit memory location):
--------
#define SOE0L (*(unsigned char *)0xF012A)
void orTest()
{
SOE0L |= 3;
}
--------
produces (with -O0)
28 _test:
29 0000 C9 F0 2A 01 movw r8, #298
30 0004 C9 F2 2A 01 movw r10, #298
31 0008 AD F2 movw ax, r10
32 000a BD F4 movw r12, ax
33 000c FA F4 movw hl, r12
34 000e 8B mov a, [hl]
35 000f 9D F2 mov r10, a
36 0011 6A F2 03 or r10, #3
37 0014 AD F0 movw ax, r8
38 0016 BD F4 movw r12, ax
39 0018 DA F4 movw bc, r12
40 001a 8D F2 mov a, r10
41 001c 48 00 00 mov [bc], a
42 001f D7 ret
In some cases, the normal optimization steps remove a lot, if not all,
of the unnecessary register passing, but not always.
The conditions on the movhi_real insn allow an immediate value to be
stored in (for example) HL directly, and yet I cannot find a single
instance in my project where it isn't in the form of
movw r8, #298
movw ax, r10
movw hl, ax
and no manner of re-arranging the conditions (that I've found) will
cause the correct code to be generated. It's determined to put the
immediate value into rX, and then copy that into ax (which is also
unnecessary).
I see the same problem with 'cmp' when the value to be compared is in
the A register:
mov r8, a
cmp r8, #3
The A register is the one register that can be almost guaranteed to be
usable with any instruction, and copying it to R8 (or wherever) to
perform the comparison not only wastes two bytes for the move but also
makes the cmp instruction a byte longer, so five bytes are used instead
of two.
I looked at the code produced for IA64 and ARM targets, and although I'm
not as familiar with those instruction sets, they didn't appear to do as
much needless copying, which strengthens my suspicion that it's
something in the RL78 backend that needs 'tweaking'.
The suggestions made regarding 'volatile' were very helpful and I've
made some good savings elsewhere by adding support for different
addressing modes and more efficient instructions but there are still a
number of (theoretically) easy pickings that should (I feel) be possible
before more complicated optimizations need to be looked at.
As ever, any suggestions are very gratefully received. I hope to be
able to post some patches once I'm comfortable that I haven't missed
anything obvious or done something stupid.
Regards,
Richard.