On Wednesday 30 July 2008 19:14, Agner Fog wrote: > I agree that the OpenSolaris memcpy is bigger than necessary. However, > it is necessary to have 16 branches for covering all possible alignments > modulo 16. This is because, unfortunately, there is no XMM shift > instruction with a variable count, only with a constant count, so we > need one branch for each value of the shift count. Since only one of the > branches is used, it doesn't take much space in the code cache. The > speed is improved by a factor 4-5 by this 16-branch algorithm, so it is > certainly worth the extra complexity.
I tend to doubt that odd-byte aligned large memcpys are anywhere near typical. malloc and mmap both return well-aligned buffers (say, 8 byte aligned). Static and on-stack objects are also at least word-aligned 99% of the time. memcpy can just use "relatively simple" code for copies in which either src or dst is not word aligned. This cuts possibilities down from 16 to 4 (or even 2?). -- vda