Am 01.06.09 08:14 schrieb(en) Joakim Tjernlund:
.. not even 4.2.2 which is fairly modern will get it right. It breaks very easy as gcc has never been any good at this type of optimization. Sometimes small changes will make gcc unhappy and it won't do the right optimization.

It's even worse... Looking at the assembly output of the simple function

<snip>
void loop2(void * src, void * dst, int n)
{
  volatile uint32_t * _dst = (volatile uint32_t *) (dst - 4);
  volatile uint32_t * _src = (volatile uint32_t *) (src - 4);
  n >>= 2;
  do {
    *(++_dst) = *(++_src);
  } while (--n);
}
</snip>

gcc 4.0.1 coming with Apple's Developer Tools (on Tiger) with options "-O3 -mcpu=603e -mtune=603e" produces

<snip>
_loop2:
        srawi r5,r5,2
        mtctr r5
        addi r4,r4,-4
        addi r3,r3,-4
L11:
        lwzu r0,4(r3)
        stwu r0,4(r4)
        bdnz L11
        blr
</snip>

which looks perfect to me. However, gcc 4.3.3 on Ubuntu/PPC produces with the same options

<snip>
loop2:
        srawi 5,5,2
        stwu 1,-16(1)
        mtctr 5
        li 9,0
.L8:
        lwzx 0,3,9
        stwx 0,4,9
        addi 9,9,4
        bdnz .L8
        addi 1,1,16
        blr
</snip>

wasting a register and a statement in the loop core, and fiddles around with the stack pointer for no good reason. Gcc 4.4.0 produces

<snip>
loop2:
        srawi 5,5,2
        mtctr 5
        li 9,0
.L9:
        lwzx 0,3,9
        stwx 0,4,9
        addi 9,9,4
        bdnz .L9
        blr
</snip>

which drops the r1 accesses, but still produces the sub-optimal loop. Is this a gcc regression, or did I miss something here? Probably the only bullet-proof way is to write some core loops in assembly... :-/

Thanks, Albrecht.

Attachment: pgpMRQRHxmbwt.pgp
Description: PGP signature

_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Reply via email to