RE: [PATCH] eal/x86: improve rte_memcpy const size 16 performance

Morten Brørup Sun, 03 Mar 2024 02:07:31 -0800

> From: Stephen Hemminger [mailto:step...@networkplumber.org]
> Sent: Sunday, 3 March 2024 06.58
> 
> On Sat, 2 Mar 2024 21:40:03 -0800
> Stephen Hemminger <step...@networkplumber.org> wrote:
> 
> > On Sun,  3 Mar 2024 00:48:12 +0100
> > Morten Brørup <m...@smartsharesystems.com> wrote:
> >
> > > When the rte_memcpy() size is 16, the same 16 bytes are copied
> twice.
> > > In the case where the size is knownto be 16 at build tine, omit the
> > > duplicate copy.
> > >
> > > Reduced the amount of effectively copy-pasted code by using #ifdef
> > > inside functions instead of outside functions.
> > >
> > > Suggested-by: Stephen Hemminger <step...@networkplumber.org>
> > > Signed-off-by: Morten Brørup <m...@smartsharesystems.com>
> > > ---
> >
> > Looks good, let me see how it looks in goldbolt vs Gcc.
> >
> > One other issue is that for the non-constant case, rte_memcpy has an
> excessively
> > large inline code footprint. That is one of the reasons Gcc doesn't
> always
> > inline.  For > 128 bytes, it really should be a function.


Yes, the code footprint is significant for the non-constant case.
I suppose Intel considered the cost and benefits when they developed this.
Or perhaps they just wanted a showcase for their new and shiny vector 
instructions. ;-)

Inlining might provide significant branch prediction benefits in cases where 
the size is not build-time constant, but run-time constant.

> 
> For size of 4,6,8,16, 32, 64, up to 128 Gcc inline and rte_memcpy match.
> 
> For size 128. It looks gcc is simpler.
> 
> rte_copy_addr:
>         vmovdqu ymm0, YMMWORD PTR [rsi]
>         vextracti128    XMMWORD PTR [rdi+16], ymm0, 0x1
>         vmovdqu XMMWORD PTR [rdi], xmm0
>         vmovdqu ymm0, YMMWORD PTR [rsi+32]
>         vextracti128    XMMWORD PTR [rdi+48], ymm0, 0x1
>         vmovdqu XMMWORD PTR [rdi+32], xmm0
>         vmovdqu ymm0, YMMWORD PTR [rsi+64]
>         vextracti128    XMMWORD PTR [rdi+80], ymm0, 0x1
>         vmovdqu XMMWORD PTR [rdi+64], xmm0
>         vmovdqu ymm0, YMMWORD PTR [rsi+96]
>         vextracti128    XMMWORD PTR [rdi+112], ymm0, 0x1
>         vmovdqu XMMWORD PTR [rdi+96], xmm0
>         vzeroupper
>         ret

Interesting. Playing around with Godbolt revealed that GCC version < 11 creates 
the above from rte_memcpy, whereas GCC version >= 11 does it correctly. Clang 
doesn't have this issue.
I guess that's why the original code treated AVX as SSE.
Fixed in v2.

> copy_addr:
>         vmovdqu ymm0, YMMWORD PTR [rsi]
>         vmovdqu YMMWORD PTR [rdi], ymm0
>         vmovdqu ymm1, YMMWORD PTR [rsi+32]
>         vmovdqu YMMWORD PTR [rdi+32], ymm1
>         vmovdqu ymm2, YMMWORD PTR [rsi+64]
>         vmovdqu YMMWORD PTR [rdi+64], ymm2
>         vmovdqu ymm3, YMMWORD PTR [rsi+96]
>         vmovdqu YMMWORD PTR [rdi+96], ymm3
>         vzeroupper
>         ret

RE: [PATCH] eal/x86: improve rte_memcpy const size 16 performance

Reply via email to