Re: [RFC v2] non-temporal memcpy

Konstantin Ananyev Fri, 29 Jul 2022 02:23:41 -0700

28/07/2022 11:51, Morten Brørup пишет:

From: Stanisław Kardach [mailto:k...@semihalf.com]
Sent: Thursday, 28 July 2022 00.02

On Wed, 27 Jul 2022, 21:53 Honnappa Nagarahalli, <honnappa.nagaraha...@arm.com> 
wrote:

Yes, x86 needs 16B alignment for NT load/stores But that's

supposed

to be arch

specific limitation, that we probably want to hide, no?


Correct. However, optional hints for optimization purposes will be

available.

And it is up to the architecture specific implementation to make the

best use

of these hints, or just ignore them.

Inside the function can check alignment of both src and dst and

decide should it

use NT load/store instructions or just do normal copy.

IMO, the normal copy should not be done by this API under any
conditions. Why not let the application call memcpy/rte_memcpy
when the NT copy is not applicable? It helps the programmer to

understand

and debug the issues much easier.


Yes, the programmer must choose between normal memcpy() and non-
temporal rte_memcpy_nt(). I am offering new functions, not modifying
memcpy() or rte_memcpy().

And rte_memcpy_nt() will silently fall back to normal memcpy() if

non-

temporal copying is unavailable, e.g. on POWER and RISC-V

architectures,

which don't have NT load/store instructions.

I am talking about a scenario where the application is being ported
between architectures. Not everyone knows about the capabilities of
the architecture. It is better to indicate upfront (ex: compilation
failures) that a certain feature is not supported on the target
architecture rather than the user having to discover through painful
debugging.


I'm considering rte_memcpy_nt() a performance optimized variant of
memcpy(), where the performance gain is less cache pollution. Thus, silent
fallback to memcpy() should suffice.

Other architecture differences also affect DPDK performance; the inability to
perform non-temporal load/store just one more to the (undocumented) list.

Failing at build time if NT load/store is unavailable by the architecture would
prevent the function from being used by other DPDK libraries, e.g. by the
rte_pktmbuf_copy() function used by the pdump library.

The other libraries in DPDK need to provide NT versions as the libraries need 
to cater for not-NT use cases as well. i.e. we cannot hide a NT copy under 
rte_pktmbuf_copy() API, we need to have rte_pktmbuf_copy_nt()


Yes, it was my intention to provide rte_pktmbuf_copy_nt() as a new function. 
Some uses of rte_pktmbuf_copy() may benefit from having the copied data in 
cache.

But there is a ripple effect:

It is also my intention to improve the pdump and pcapng libraries by using 
rte_pktmbuf_copy_nt() instead of rte_pktmbuf_copy(). These would normally 
benefit from not polluting the cache.

So the underlying rte_memcpy_nt() function needs a fallback if the architecture 
doesn't support non-temporal memory copy, now that the pdump and pcapng 
libraries depend on it.

Alternatively, if rte_memcpy_nt() has no fallback to standard memcpy(), but an 
application fails to build if the application developer tries to use 
rte_memcpy_nt(), we would have to modify e.g. pdump_copy() like this:

+ #ifdef RTE_CPUFLAG_xxx
   p = rte_pktmbuf_copy_nt(pkts[i], mp, 0, cbs->snaplen);
+ #else
   p = rte_pktmbuf_copy(pkts[i], mp, 0, cbs->snaplen);
+ #endif

Personally, I prefer the fallback inside rte_memcpy_nt(), rather than having to 
check for it everywhere.


+1 here.
If we going to introduce rte_memcpy_nt(), I think it better be
'best effort' approach - if it can do NT, great, if not
just fall back to normal copy.


The developer using the pdump library will not know if the fallback is inside 
rte_memcpy_nt() or outside using #ifdef. It is still hidden inside pdump_copy().


I don't oppose to your idea, I just don't have any idea how to reasonably
implement it. So I'm trying to defend why it is not important.

I am suggesting that the applications could implement #ifdef depending on the 
architecture.
I assume that it would be a pre-processor flag defined (or not) on DPDK side 
and application doing #ifdef based on it?

Another way to achieve this would be to use #warning directive (see [1]) inside 
DPDK when the generic fallback is taken.

Also isn't the argument on memcpy_nt capability query not a more general one, 
that is how would/should application query DPDK's capabilities when run or 
compiled?


Good point! You just solved this part of the puzzle, Stanislaw:

The ability to perform non-temporal memory load/store is a CPU feature.

Applications that need to know if non-temporal memory access is available 
should check for the appropriate CPU feature flag, e.g. RTE_CPUFLAG_SSE4_1 on 
x86 architecture. This works both at runtime and at compile time.


[1] https://gcc.gnu.org/onlinedocs/cpp/Diagnostics.html

Re: [RFC v2] non-temporal memcpy

Reply via email to