https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85721
--- Comment #4 from Mathias Stearn <redbeard0531 at gmail dot com> --- Marc Glisse pointed out at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85720#c3 that my I missed an aliasing case when I created this ticket. memmove isn't a valid replacement if out is in the range (in, in + n). I did some benchmarking to see what the best solution is and how much this matters. This seems to do the best on sandybridge, haswell, and an Opteron 6344 Piledriver: [[gnu::noinline, gnu::optimize("s")]] void copy0(char* out, const char* in, size_t n) { if (n >= 8 &&__builtin_expect(out >= in + n || out + n <= in, 1)) { memcpy(out, in, n); return; } for (size_t i = 0; i < n; i++){ out[i] = in[i]; } } copy0(char*, char const*, unsigned long): cmp rdx, 7 jbe .L7 lea rax, [rsi+rdx] cmp rdi, rax jnb .L3 lea rax, [rdi+rdx] cmp rsi, rax jb .L7 .L3: jmp memcpy .L7: xor eax, eax .L5: cmp rax, rdx je .L1 mov cl, BYTE PTR [rsi+rax] mov BYTE PTR [rdi+rax], cl inc rax jmp .L5 .L1: ret With char, it is substantially faster than the current codegen for the orignal loop at -O2 and moderately faster than -O3, while being about 10% the size. With a TriviallyCopiable type with a non-trivial default ctor, even -O3 does byte-by-byte, so it is a substantial win there as well. Let me know if you'd like me to post the benchmark I was using.