On Wed, 2009-06-03 at 08:51 +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2009-06-02 at 20:45 +0200, Albrecht Dreß wrote:
> 
> > 
> > which drops the r1 accesses, but still produces the sub-optimal loop.   
> > Is this a gcc regression, or did I miss something here?  Probably the  
> > only bullet-proof way is to write some core loops in assembly... :-/
> 
> Well, gcc may be right here. What you call the "optimal" loop uses the
> lwzu instruction. An interesting thing about this instruction is that
> it updates two GPRs at completion (I'm ignoring the load multiple and
> string instructions on purpose here).

> I wouldn't be surprised thus if the loop variant with the separate add
> ends up more efficient on most implementations around.

On an e300 core using the lwzu/stwu is about 20% faster so at least one
core prefer that optimization. 


_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Reply via email to