https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90262
--- Comment #3 from Liu Hao <lh_mouse at 126 dot com> --- This exists on x86_64 too: https://gcc.godbolt.org/z/z5MW4E4aE ```c int xcopy(char* dst, const char* src) { __builtin_memmove(dst, src, 32); return dst[0]; } ``` Clang generates this assembly: ``` xcopy(char*, char const*): # @xcopy(char*, char const*) movups xmm0, xmmword ptr [rsi] movups xmm1, xmmword ptr [rsi + 16] movups xmmword ptr [rdi], xmm0 movups xmmword ptr [rdi + 16], xmm1 movsx eax, byte ptr [rdi] ret ``` which comprises two XMM loads followed by two XMM stores, and should work as expected no matter whether `dst` and `src` point to overlapped regions. But GCC generates a call to `memmove()` instead, and is rather inefficient for this tiny amount of memory: ``` xcopy(char*, char const*): sub rsp, 8 mov edx, 32 call memmove movsx eax, BYTE PTR [rax] add rsp, 8 ret ```