On Thu, Apr 15, 2010 at 03:53:53PM +0200, Roman Fietze wrote: > Hello Bill, > > On Thursday 15 April 2010 15:01:59 Bill Gatliff wrote: > > > Are you talking about this code here? > > > > void > > shadowUpdatePacked (ScreenPtr pScreen, > > shadowBufPtr pBuf) > > { > > ... > > while (i--) > > *win++ = *sha++; > > Yes. I added a routine like > > /* Swap frame buffer bytes in 32 bit value. */ > static __inline unsigned int > fbbits_swap32(unsigned int __bsx) > { > return ((((__bsx) & 0xff000000) >> 8) | (((__bsx) & 0x00ff0000) << 8) | > (((__bsx) & 0x0000ff00) >> 8) | (((__bsx) & 0x000000ff) << 8)); > }
I don't see the difference with: return (((__bsx & 0xff00ff00)>> 8) | ((__bsx & 0x00ff00ff) << 8)); for which the compiler (GCC 4.3.2) generates better code (GCC 4.3.2) as shown. In the first case: .L3: lwzx 9,3,8 rlwinm 0,9,8,0,7 rlwinm 11,9,24,8,15 rlwinm 10,9,24,24,31 or 0,0,11 or 0,0,10 rlwinm 9,9,8,16,23 or 0,0,9 stwx 0,4,8 addi 8,8,4 bdnz .L3 in the second: .L9: lwzx 0,3,11 and 9,0,10 and 0,0,8 slwi 0,0,8 srwi 9,9,8 or 0,0,9 stwx 0,4,11 addi 11,11,4 bdnz .L9 saving 2 instructions. AFAIR the MPC5200 is based on a 603e core, so the integer instructions have to go to the single integer unit that can handle them (the second IU can only handle add and cmp), so the mimimum is 5 clocks/iteration versus 7. Even with two IU (or 3), the second code has better latency. Gabriel _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev