Re: gcc will become the best optimizing x86 compiler

Denys Vlasenko Thu, 31 Jul 2008 16:52:15 -0700

On Thursday 31 July 2008 11:36, Dave Korn wrote:
> Agner Fog wrote on 31 July 2008 07:14:
> 
> > Denys Vlasenko wrote:
> >> I tend to doubt that odd-byte aligned large memcpys are anywhere
> >> near typical. malloc and mmap both return well-aligned buffers
> >> (say, 8 byte aligned). Static and on-stack objects are also
> >> at least word-aligned 99% of the time.
> >> 
> >> memcpy can just use "relatively simple" code for copies in which
> >> either src or dst is not word aligned. This cuts possibilities down from
> >> 16 to 4 (or even 2?). 
> >> 
> > The XMM code is still more than 3 times faster than rep movsl when data
> > are aligned by 4 or 8, but not by 16.
> > Even if odd addresses are rare, they must be supported, but we can put
> > the most common cases first.
> 
>   In the real world, unaligned memcpys are anything but rare.  Everything's
> networked these days, remember?  Stuff gets misaligned real quick when you
> start adding and removing various network layer headers and trailers to
> unpredictably-sized packets.


Headers are usually at least rounded to 16-bit multiple.
Trailers do not matter to alignment.

This happens in kernel space. In kernel space, there are many differences.
* memcpy are rarely bigger than a page - this negates any gains from
  non-temporal MOVs
* memcpy cannot use XMM registers (otherwise lazy FPU saving doesn't work)
* kernel developers would never accept multi-kilobyte memcpy
  implementation anyway

Not to mention that network packets are ~1500 bytes long, even jumbo packets
are only 9k tops.
--
vda

Re: gcc will become the best optimizing x86 compiler

Reply via email to