Re: ARM compiler rewriting code to be longer and slower

Ramana Radhakrishnan Sun, 15 Mar 2009 15:19:36 -0700

Hi Zoltan,

<some parts snipped>
On Fri, Mar 13, 2009 at 9:16 AM,  <zol...@bendor.com.au> wrote:


> Note that it is sub-optimal on two counts.
>
> First, each loading of a constant takes 3 instructions and 3 clocks.
> Storing the constant and fetching it using an ldr also takes 3 clocks but
> only two 32-bit words and identical constants need to be stored only once.
> The speed increase is only true on the ARM7TDMI-S, which has no caches, so
> that's just a minor issue, but the memory saving is true no matter what
> ARM core you have (note that -Os was specified).
>
> Second, and this is the real problem, if the compiler did not want to be
> overly clever and compiled the code as it was written, then instead of
> loading the constants 4 times, at the cost of 3 instuctions each, it could
> have loaded it only once and then generated the next constants at the cost
> of a single-word, single clock shift. The code would have been rather
> shorter *and* faster, plus some of the jumps could have been eliminated.
> Practically each C statement line (except the braces) corresponds to one
> assembly instruction, so without being clever, just translating what's
> written, it could be done in 20 words instead of 30.

I took a look at this for some time on Friday and I found that the
conditional constant propagation pass has pushed down the value
(tree-ssa-ccp.c). This is done by the CCP pass up in the optimization
pipeline because in general constant propagation is a good idea . In
any case there are a bunch of tree optimizers that identify these and
generally bring in constants into expressions as generally a good
idea. One might argue that constant propagation in general is a good
thing but the problem appears to be that the moment one has an
architecture where costs of loading immediate's is higher than the
cost of simple arithmetic operations the final code generated might
not be the most efficient.


With some more experimentation in the last hour or so I found that for
this particular case, I can get the following code

divby1e9:
        @ Function supports interworking.
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        ldr     r3, .L7
        cmp     r0, r3
        mov     r2, #0
        bcc     .L2
        mov     r3, r3, asl #2
        cmp     r0, r3
        rsbcs   r0, r3, r0
        addcs   r2, r2, #4
        bcs     .L2
        mov     r3, r3, lsr #1
        cmp     r0, r3
        rsbcs   r0, r3, r0
        mov     r3, r3, lsr #1
        movcs   r2, #2
        cmp     r0, r3
        rsbcs   r0, r3, r0
        addcs   r2, r2, #1
.L2:
        str     r2, [r1, #0]
        bx      lr
.L8:
        .align  2
.L7:
        .word   1000000000
        .size   divby1e9, .-divby1e9
        .ident  "GCC: (GNU) 4.4.0 20090313 (experimental) [trunk revision 
143499]"


but with the following command line options.

./xgcc -B`pwd` -S -Os newpr.c -fno-tree-ccp -fno-tree-fre
-fno-tree-vrp -fno-tree-dominator-opts -fno-gcse


I'm not sure about the best way to fix this but I've filed this for
the moment as

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39468



cheers
Ramana

---
Ramana Radhakrishnan
ARM Ltd.

Re: ARM compiler rewriting code to be longer and slower

Reply via email to