[Bug tree-optimization/39468] New: Constant propagation in a number of tree passes does not take into account machine costs.

ramana dot r at gmail dot com Sun, 15 Mar 2009 15:18:51 -0700

As reported at the thread in http://gcc.gnu.org/ml/gcc/2009-03/msg00369.html


Using 4.4.0 gcc, I compiled a function and found it a tad long. The
command line is:

gcc -Os -mcpu=arm7tdmi-s -S func.c

although the output is pretty much the same with -O2 or -O3 as well (only
a few instructions longer).

The function is basically an unrolled 32 bit unsigned division by 1E9:

unsigned int divby1e9( unsigned int num, unsigned int *quotient )
{
unsigned int dig;
unsigned int tmp;
 tmp = 1000000000u;
 dig = 0;
 if ( num >= tmp ) {
    tmp <<= 2;
    if ( num >= tmp ) {
        num -= tmp;
        dig  = 4;
    }
    else {
        tmp >>= 1;
        if ( num >= tmp ) {
            num -= tmp;
            dig  = 2;
        }
        tmp >>= 1;
        if ( num >= tmp ) {
            num -= tmp;
            dig++;
        }
    }
  }
  *quotinet = dig;
  return num;
}

The compiler generated the following code:

divby1e9:
       @ Function supports interworking.
       @ args = 0, pretend = 0, frame = 0
       @ frame_needed = 0, uses_anonymous_args = 0
       @ link register save eliminated.
       ldr     r3, .L10
       cmp     r0, r3
       movls   r3, #0
       bls     .L3
       ldr     r2, .L10+4
       cmp     r0, r2
       addhi   r0, r0, #293601280
       addhi   r0, r0, #1359872
       addhi   r0, r0, #6144
       movhi   r3, #4
       bhi     .L3
.L4:
       ldr     r2, .L10+8
       cmp     r0, r2
       movls   r3, #0
       bls     .L6
       add     r0, r0, #-2013265920
       add     r0, r0, #13238272
       add     r0, r0, #27648
       cmp     r0, r3
       movls   r3, #2
       bls     .L3
       mov     r3, #2
.L6:
       add     r0, r0, #-1006632960
       add     r0, r0, #6619136
       add     r0, r0, #13824
       add     r3, r3, #1
.L3:
       str     r3, [r1, #0]
       bx      lr
.L11:
       .align  2
.L10:
       .word   999999999
       .word   -294967297
       .word   1999999999


Note that it is sub-optimal on two counts.

First, each loading of a constant takes 3 instructions and 3 clocks.
Storing the constant and fetching it using an ldr also takes 3 clocks but
only two 32-bit words and identical constants need to be stored only once.
The speed increase is only true on the ARM7TDMI-S, which has no caches, so
that's just a minor issue, but the memory saving is true no matter what
ARM core you have (note that -Os was specified).

Second, and this is the real problem, if the compiler did not want to be
overly clever and compiled the code as it was written, then instead of
loading the constants 4 times, at the cost of 3 instuctions each, it could
have loaded it only once and then generated the next constants at the cost
of a single-word, single clock shift. The code would have been rather
shorter *and* faster, plus some of the jumps could have been eliminated.
Practically each C statement line (except the braces) corresponds to one
assembly instruction, so without being clever, just translating what's
written, it could be done in 20 words instead of 30.


-- 
           Summary: Constant propagation in a number of tree passes does not
                    take into account machine costs.
           Product: gcc
           Version: lto
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: ramana dot r at gmail dot com
 GCC build triplet: i686-unknown-linux-gnu
  GCC host triplet: i686-unknown-linux-gnu
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39468

[Bug tree-optimization/39468] New: Constant propagation in a number of tree passes does not take into account machine costs.

Reply via email to