https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104723

--- Comment #11 from cuilili <lili.cui at intel dot com> ---
(In reply to Jakub Jelinek from comment #10)

> And for the backend, the question is how big the penalty for the overlapping
> store is compared to doing multiple non-overlapping stores.  Say for those
> 49 bytes one could do one OI, one TI/V1TI and one QI load/store as opposed to
> one aligned and one misaligned OI load/store.
> 
> For say:
> void
> foo (void *p, void *q)
> {
>   __builtin_memcpy (p, q, 49);
> }
> we emit the 2 overlapping loads/stores for -mavx512f and 4 non-overlapping
> loads/stores with say -mavx2.

I execute both code sequence 100000 times on ICX and znver3 machines.

For ICX: 2 overlapping loads/stores are 3.5x faster than 4 non-overlapping
loads/stores.
For Znver3: 2 overlapping loads/stores are 1.39x faster than 4 non-overlapping
loads/stores.

------------------------------------
vmovdqu ymm0, YMMWORD PTR [rsi]
vmovdqu YMMWORD PTR [rdi], ymm0
vmovdqu ymm1, YMMWORD PTR [rsi+17]
vmovdqu YMMWORD PTR [rdi+17], ymm1

------------------------------------
vmovdqu xmm0, XMMWORD PTR [rsi]
vmovdqu XMMWORD PTR [rdi], xmm0
vmovdqu xmm1, XMMWORD PTR [rsi+16]
vmovdqu XMMWORD PTR [rdi+16], xmm1
vmovdqu xmm2, XMMWORD PTR [rsi+32]
vmovdqu XMMWORD PTR [rdi+32], xmm2
movzx   eax, BYTE PTR [rsi+48]
mov     BYTE PTR [rdi+48], al
-----------------------------------

Reply via email to