On Tue, 2009-06-02 at 20:45 +0200, Albrecht Dreß wrote: > > which drops the r1 accesses, but still produces the sub-optimal loop. > Is this a gcc regression, or did I miss something here? Probably the > only bullet-proof way is to write some core loops in assembly... :-/
Well, gcc may be right here. What you call the "optimal" loop uses the lwzu instruction. An interesting thing about this instruction is that it updates two GPRs at completion (I'm ignoring the load multiple and string instructions on purpose here). Now, quite a few simple implementations don't have two write ports to the GPR file, nor the logic to handle hazards properly with two GPRs being updated... which means the instruction is very likely to take a very inefficient path through the pipeline. On server processors, I'm pretty sure it's just cracked into a load and an add anyway. I wouldn't be surprised thus if the loop variant with the separate add ends up more efficient on most implementations around. Of course, the loop above could use some unrolling to put some distance between the load and the store of the loaded value. Cheers, Ben. _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev