https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90262

--- Comment #3 from Liu Hao <lh_mouse at 126 dot com> ---
This exists on x86_64 too:  https://gcc.godbolt.org/z/z5MW4E4aE

```c
int xcopy(char* dst, const char* src)
  {
    __builtin_memmove(dst, src, 32);
    return dst[0];
  }

```


Clang generates this assembly:

```
xcopy(char*, char const*):                          # @xcopy(char*, char
const*)
        movups  xmm0, xmmword ptr [rsi]
        movups  xmm1, xmmword ptr [rsi + 16]
        movups  xmmword ptr [rdi], xmm0
        movups  xmmword ptr [rdi + 16], xmm1
        movsx   eax, byte ptr [rdi]
        ret
```

which comprises two XMM loads followed by two XMM stores, and should work as
expected no matter whether `dst` and `src` point to overlapped regions.


But GCC generates a call to `memmove()` instead, and is rather inefficient for
this tiny amount of memory:

```
xcopy(char*, char const*):
        sub     rsp, 8
        mov     edx, 32
        call    memmove
        movsx   eax, BYTE PTR [rax]
        add     rsp, 8
        ret
```

Reply via email to