gcc 3.4 > mainline performance regression

Andrew Haley Fri, 05 Jan 2007 06:02:26 -0800

This is from the gcc-help mailing list.  It's mentioned there for ARM,
but it's just as bad for x86-64.


It appears that memory references to arrays aren't being hoisted out
of loops: in this test case, gcc 3.4 doesn't touch memory at all in
the loop, but 4.3pre (and 4.2, etc) does.

Here's the test case:

void foo(int *a)
{       int i;
        for (i = 0; i < 1000000; i++)
   a[0] += a[1];
}

gcc 3.4.5 -O2:

.L5:
        leal    (%rcx,%rsi), %edx
        decl    %eax
        movl    %edx, %ecx
        jns     .L5

gcc 4.3pre -O2:

.L2:
        addl    4(%rdi), %eax
        addl    $1, %edx
        cmpl    $1000000, %edx
        movl    %eax, (%rdi)
        jne     .L2

Thoughts?

Andrew.

--- Begin Message ---

Hi David,

I've noticed the same problem with the GCC 4.1.3 (in thumb mode for ARM).

When a simple test file is compiled with -s to only get the "pseudo" assembly 
form, quality of the generated code is quite poor. I've seen an equivalent 
inquiry to codesourcery mailing list, quoting that even if gcc 4.x series 
perform  good optimization, simple cases of loop are sadly compiled. But I'm 
quite surprised by the miss of comments to this quote...

Ps: Happy new year everybody!


______________________________

Hi,

  We were using GCC 3.4.0 to generate Thumb code for ARM processor,
switching to GCC 4.1.1 has improved our code size (we always use -Os switch),
but has severely altered the execution speed.
 After further investigation, we isolate one the problem in the
following example:
 Source code:
void foo(int *a)
{       int i;
        for (i = 0; i < 1000000; i++)
   a[0] += a[1];
}
The result with GCC 3.4.0 with -mthumb -Os was:
00000000 <foo>:
  0:    b500            push    {lr}
  2:    6803            ldr     r3, [r0, #0]
  4:    4a03            ldr     r2, [pc, #12]   (14 <.text+0x14>)
  6:    6841            ldr     r1, [r0, #4]
  8:    3a01            sub     r2, #1
  a:    185b            add     r3, r3, r1
  c:    2a00            cmp     r2, #0
  e:    d1fb            bne     8 <foo+0x8>
 10:    6003            str     r3, [r0, #0]
 12:    bd00            pop     {pc}
 14:    4240            neg     r0, r0
 16:    000f            lsl     r7, r1, #0
 when compiled for ARM with GCC 4.1.1 (and mainline too) with -mthumb
-O1, we get:
00000000 <foo>:
  0:    b510            push    {r4, lr}
  2:    1c04            adds    r4, r0, #0
  4:    2200            movs    r2, #0
  6:    6841            ldr     r1, [r0, #4]
  8:    4803            ldr     r0, [pc, #12]   (18 <.text+0x18>)
  a:    6823            ldr     r3, [r4, #0]
  c:    185b            adds    r3, r3, r1
  e:    3201            adds    r2, #1
 10:    4282            cmp     r2, r0
 12:    d1fb            bne.n   c <foo+0xc>
 14:    6023            str     r3, [r4, #0]
 16:    bd10            pop     {r4, pc}
 18:    4240            negs    r0, r0
 1a:    000f            lsls    r7, r1, #0
-> No so bad but slower than 3.4.0

   when compiled with -mthumb -Os, we get:
00000000 <foo>:
  0:    b510            push    {r4, lr}
  2:    6802            ldr     r2, [r0, #0]
  4:    6844            ldr     r4, [r0, #4]
  6:    2100            movs    r1, #0
  8:    4b03            ldr     r3, [pc, #12]   (18 <.text+0x18>)
  a:    3101            adds    r1, #1
  c:    1912            adds    r2, r2, r4
  e:    4299            cmp     r1, r3
 10:    d1fa            bne.n   8 <foo+0x8>
 12:    6002            str     r2, [r0, #0]
 14:    bd10            pop     {r4, pc}
 16:    0000            lsls    r0, r0, #0
 18:    4240            negs    r0, r0
 1a:    000f            lsls    r7, r1, #0
 -> The Load of the loop end value is performed within the loop !

   when compiled with -mthumb -O3, we get:
00000000 <foo>:
  0:    b530            push    {r4, r5, lr}
  2:    6802            ldr     r2, [r0, #0]
  4:    4d05            ldr     r5, [pc, #20]   (1c <.text+0x1c>)
  6:    1d04            adds    r4, r0, #4
  8:    2100            movs    r1, #0
  a:    6823            ldr     r3, [r4, #0]
  c:    3101            adds    r1, #1
  e:    18d3            adds    r3, r2, r3
 10:    1c1a            adds    r2, r3, #0
 12:    6003            str     r3, [r0, #0]
 14:    42a9            cmp     r1, r5
 16:    d1f8            bne.n   a <foo+0xa>
 18:    bd30            pop     {r4, r5, pc}
 1a:    0000            lsls    r0, r0, #0
 1c:    4240            negs    r0, r0
 1e:    000f            lsls    r7, r1, #0
 -> Amazingly slow !

    Does anybody has a magic set of options to generate an efficient and
small code as 3.4.0 did.
 Thanks in advance for any hints on this problem.
 David 



__________________________________________________
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible 
contre les messages non sollicités 
http://mail.yahoo.fr Yahoo! Mail

--- End Message ---

gcc 3.4 > mainline performance regression

Reply via email to