RE: [RFC v2] non-temporal memcpy

Morten Brørup Thu, 28 Jul 2022 03:51:44 -0700

From: Stanisław Kardach [mailto:k...@semihalf.com] 
Sent: Thursday, 28 July 2022 00.02
> On Wed, 27 Jul 2022, 21:53 Honnappa Nagarahalli, 
> <honnappa.nagaraha...@arm.com> wrote:
>
> > > > > > Yes, x86 needs 16B alignment for NT load/stores But that's
> > > supposed
> > > > > to be arch
> > > > > > specific limitation, that we probably want to hide, no?
> > > >
> > > > Correct. However, optional hints for optimization purposes will be
> > > available.
> > > > And it is up to the architecture specific implementation to make the
> > > best use
> > > > of these hints, or just ignore them.
> > > >
> > > > > > Inside the function can check alignment of both src and dst and
> > > > > decide should it
> > > > > > use NT load/store instructions or just do normal copy.
> > > > > IMO, the normal copy should not be done by this API under any
> > > > > conditions. Why not let the application call memcpy/rte_memcpy
> > > > > when the NT copy is not applicable? It helps the programmer to
> > > understand
> > > > > and debug the issues much easier.
> > > >
> > > > Yes, the programmer must choose between normal memcpy() and non-
> > > > temporal rte_memcpy_nt(). I am offering new functions, not modifying
> > > > memcpy() or rte_memcpy().
> > > >
> > > > And rte_memcpy_nt() will silently fall back to normal memcpy() if
> > > non-
> > > > temporal copying is unavailable, e.g. on POWER and RISC-V
> > > architectures,
> > > > which don't have NT load/store instructions.
> > > I am talking about a scenario where the application is being ported
> > > between architectures. Not everyone knows about the capabilities of
> > > the architecture. It is better to indicate upfront (ex: compilation
> > > failures) that a certain feature is not supported on the target
> > > architecture rather than the user having to discover through painful
> > > debugging.
> > 
> > I'm considering rte_memcpy_nt() a performance optimized variant of
> > memcpy(), where the performance gain is less cache pollution. Thus, silent
> > fallback to memcpy() should suffice.
> > 
> > Other architecture differences also affect DPDK performance; the inability 
> > to
> > perform non-temporal load/store just one more to the (undocumented) list.
> > 
> > Failing at build time if NT load/store is unavailable by the architecture 
> > would
> > prevent the function from being used by other DPDK libraries, e.g. by the
> > rte_pktmbuf_copy() function used by the pdump library.
> The other libraries in DPDK need to provide NT versions as the libraries need 
> to cater for not-NT use cases as well. i.e. we cannot hide a NT copy under 
> rte_pktmbuf_copy() API, we need to have rte_pktmbuf_copy_nt()


Yes, it was my intention to provide rte_pktmbuf_copy_nt() as a new function. 
Some uses of rte_pktmbuf_copy() may benefit from having the copied data in 
cache.

But there is a ripple effect:

It is also my intention to improve the pdump and pcapng libraries by using 
rte_pktmbuf_copy_nt() instead of rte_pktmbuf_copy(). These would normally 
benefit from not polluting the cache.

So the underlying rte_memcpy_nt() function needs a fallback if the architecture 
doesn't support non-temporal memory copy, now that the pdump and pcapng 
libraries depend on it.

Alternatively, if rte_memcpy_nt() has no fallback to standard memcpy(), but an 
application fails to build if the application developer tries to use 
rte_memcpy_nt(), we would have to modify e.g. pdump_copy() like this:

+ #ifdef RTE_CPUFLAG_xxx
  p = rte_pktmbuf_copy_nt(pkts[i], mp, 0, cbs->snaplen);
+ #else
  p = rte_pktmbuf_copy(pkts[i], mp, 0, cbs->snaplen);
+ #endif

Personally, I prefer the fallback inside rte_memcpy_nt(), rather than having to 
check for it everywhere.

The developer using the pdump library will not know if the fallback is inside 
rte_memcpy_nt() or outside using #ifdef. It is still hidden inside pdump_copy().

> 
> > 
> > I don't oppose to your idea, I just don't have any idea how to reasonably
> > implement it. So I'm trying to defend why it is not important.
> I am suggesting that the applications could implement #ifdef depending on the 
> architecture.
> I assume that it would be a pre-processor flag defined (or not) on DPDK side 
> and application doing #ifdef based on it?
> 
> Another way to achieve this would be to use #warning directive (see [1]) 
> inside DPDK when the generic fallback is taken.
> 
> Also isn't the argument on memcpy_nt capability query not a more general one, 
> that is how would/should application query DPDK's capabilities when run or 
> compiled?

Good point! You just solved this part of the puzzle, Stanislaw:

The ability to perform non-temporal memory load/store is a CPU feature.

Applications that need to know if non-temporal memory access is available 
should check for the appropriate CPU feature flag, e.g. RTE_CPUFLAG_SSE4_1 on 
x86 architecture. This works both at runtime and at compile time.

> 
> [1] https://gcc.gnu.org/onlinedocs/cpp/Diagnostics.html

RE: [RFC v2] non-temporal memcpy

Reply via email to