Hi again John, Thank you for the patient answers :-)
Thank you for pointing this out: I was mistakenly testing your Sandy Bridge code on Haswell (lacking -DRTE_MACHINE_CPUFLAG_AVX2). Correcting that, your code is both the fastest and the smallest in my humble micro benchmarking tests. Looks like you have done great work! You probably knew that already :-) but thank you for walking me through it. The code compiles to 745 bytes of object code (smaller than glibc 2.20 memcpy) and cachebenches like this: Memory Copy Library Cache Test C Size Nanosec MB/sec % Chnge ------- ------- ------- ------- 256 0.01 97587.60 1.00 384 0.01 97628.83 1.00 512 0.01 97613.95 1.00 768 0.01 147811.44 0.66 1024 0.01 158938.68 0.93 1536 0.01 168487.49 0.94 2048 0.01 174278.83 0.97 3072 0.01 156922.58 1.11 4096 0.01 145811.59 1.08 6144 0.01 157388.27 0.93 8192 0.01 149616.95 1.05 12288 0.01 149064.26 1.00 16384 0.01 107895.06 1.38 the key difference from my perspective is that glibc 2.20 memcpy performance goes way down for >= 2048 bytes when they switch from vector moves to string moves, while your code stays consistent. I will take it for a spin in a real application. Cheers, -Luke