https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104723
--- Comment #11 from cuilili <lili.cui at intel dot com> --- (In reply to Jakub Jelinek from comment #10) > And for the backend, the question is how big the penalty for the overlapping > store is compared to doing multiple non-overlapping stores. Say for those > 49 bytes one could do one OI, one TI/V1TI and one QI load/store as opposed to > one aligned and one misaligned OI load/store. > > For say: > void > foo (void *p, void *q) > { > __builtin_memcpy (p, q, 49); > } > we emit the 2 overlapping loads/stores for -mavx512f and 4 non-overlapping > loads/stores with say -mavx2. I execute both code sequence 100000 times on ICX and znver3 machines. For ICX: 2 overlapping loads/stores are 3.5x faster than 4 non-overlapping loads/stores. For Znver3: 2 overlapping loads/stores are 1.39x faster than 4 non-overlapping loads/stores. ------------------------------------ vmovdqu ymm0, YMMWORD PTR [rsi] vmovdqu YMMWORD PTR [rdi], ymm0 vmovdqu ymm1, YMMWORD PTR [rsi+17] vmovdqu YMMWORD PTR [rdi+17], ymm1 ------------------------------------ vmovdqu xmm0, XMMWORD PTR [rsi] vmovdqu XMMWORD PTR [rdi], xmm0 vmovdqu xmm1, XMMWORD PTR [rsi+16] vmovdqu XMMWORD PTR [rdi+16], xmm1 vmovdqu xmm2, XMMWORD PTR [rsi+32] vmovdqu XMMWORD PTR [rdi+32], xmm2 movzx eax, BYTE PTR [rsi+48] mov BYTE PTR [rdi+48], al -----------------------------------