Re: gcc will become the best optimizing x86 compiler

Denys Vlasenko Wed, 30 Jul 2008 11:21:59 -0700

On Wednesday 30 July 2008 19:14, Agner Fog wrote:
> I agree that the OpenSolaris memcpy is bigger than necessary. However, 
> it is necessary to have 16 branches for covering all possible alignments 
> modulo 16. This is because, unfortunately, there is no XMM shift 
> instruction with a variable count, only with a constant count, so we 
> need one branch for each value of the shift count. Since only one of the 
> branches is used, it doesn't take much space in the code cache. The 
> speed is improved by a factor 4-5 by this 16-branch algorithm, so it is 
> certainly worth the extra complexity.


I tend to doubt that odd-byte aligned large memcpys are anywhere
near typical. malloc and mmap both return well-aligned buffers
(say, 8 byte aligned). Static and on-stack objects are also
at least word-aligned 99% of the time.

memcpy can just use "relatively simple" code for copies in which
either src or dst is not word aligned. This cuts possibilities down
from 16 to 4 (or even 2?).
--
vda

Re: gcc will become the best optimizing x86 compiler

Reply via email to