> From: Stephen Hemminger [mailto:step...@networkplumber.org] > Sent: Sunday, 3 March 2024 06.58 > > On Sat, 2 Mar 2024 21:40:03 -0800 > Stephen Hemminger <step...@networkplumber.org> wrote: > > > On Sun, 3 Mar 2024 00:48:12 +0100 > > Morten Brørup <m...@smartsharesystems.com> wrote: > > > > > When the rte_memcpy() size is 16, the same 16 bytes are copied > twice. > > > In the case where the size is knownto be 16 at build tine, omit the > > > duplicate copy. > > > > > > Reduced the amount of effectively copy-pasted code by using #ifdef > > > inside functions instead of outside functions. > > > > > > Suggested-by: Stephen Hemminger <step...@networkplumber.org> > > > Signed-off-by: Morten Brørup <m...@smartsharesystems.com> > > > --- > > > > Looks good, let me see how it looks in goldbolt vs Gcc. > > > > One other issue is that for the non-constant case, rte_memcpy has an > excessively > > large inline code footprint. That is one of the reasons Gcc doesn't > always > > inline. For > 128 bytes, it really should be a function.
Yes, the code footprint is significant for the non-constant case. I suppose Intel considered the cost and benefits when they developed this. Or perhaps they just wanted a showcase for their new and shiny vector instructions. ;-) Inlining might provide significant branch prediction benefits in cases where the size is not build-time constant, but run-time constant. > > For size of 4,6,8,16, 32, 64, up to 128 Gcc inline and rte_memcpy match. > > For size 128. It looks gcc is simpler. > > rte_copy_addr: > vmovdqu ymm0, YMMWORD PTR [rsi] > vextracti128 XMMWORD PTR [rdi+16], ymm0, 0x1 > vmovdqu XMMWORD PTR [rdi], xmm0 > vmovdqu ymm0, YMMWORD PTR [rsi+32] > vextracti128 XMMWORD PTR [rdi+48], ymm0, 0x1 > vmovdqu XMMWORD PTR [rdi+32], xmm0 > vmovdqu ymm0, YMMWORD PTR [rsi+64] > vextracti128 XMMWORD PTR [rdi+80], ymm0, 0x1 > vmovdqu XMMWORD PTR [rdi+64], xmm0 > vmovdqu ymm0, YMMWORD PTR [rsi+96] > vextracti128 XMMWORD PTR [rdi+112], ymm0, 0x1 > vmovdqu XMMWORD PTR [rdi+96], xmm0 > vzeroupper > ret Interesting. Playing around with Godbolt revealed that GCC version < 11 creates the above from rte_memcpy, whereas GCC version >= 11 does it correctly. Clang doesn't have this issue. I guess that's why the original code treated AVX as SSE. Fixed in v2. > copy_addr: > vmovdqu ymm0, YMMWORD PTR [rsi] > vmovdqu YMMWORD PTR [rdi], ymm0 > vmovdqu ymm1, YMMWORD PTR [rsi+32] > vmovdqu YMMWORD PTR [rdi+32], ymm1 > vmovdqu ymm2, YMMWORD PTR [rsi+64] > vmovdqu YMMWORD PTR [rdi+64], ymm2 > vmovdqu ymm3, YMMWORD PTR [rsi+96] > vmovdqu YMMWORD PTR [rdi+96], ymm3 > vzeroupper > ret