From: Stanisław Kardach [mailto:k...@semihalf.com] Sent: Thursday, 28 July 2022 00.02 > On Wed, 27 Jul 2022, 21:53 Honnappa Nagarahalli, > <honnappa.nagaraha...@arm.com> wrote: > > > > > > > Yes, x86 needs 16B alignment for NT load/stores But that's > > > supposed > > > > > to be arch > > > > > > specific limitation, that we probably want to hide, no? > > > > > > > > Correct. However, optional hints for optimization purposes will be > > > available. > > > > And it is up to the architecture specific implementation to make the > > > best use > > > > of these hints, or just ignore them. > > > > > > > > > > Inside the function can check alignment of both src and dst and > > > > > decide should it > > > > > > use NT load/store instructions or just do normal copy. > > > > > IMO, the normal copy should not be done by this API under any > > > > > conditions. Why not let the application call memcpy/rte_memcpy > > > > > when the NT copy is not applicable? It helps the programmer to > > > understand > > > > > and debug the issues much easier. > > > > > > > > Yes, the programmer must choose between normal memcpy() and non- > > > > temporal rte_memcpy_nt(). I am offering new functions, not modifying > > > > memcpy() or rte_memcpy(). > > > > > > > > And rte_memcpy_nt() will silently fall back to normal memcpy() if > > > non- > > > > temporal copying is unavailable, e.g. on POWER and RISC-V > > > architectures, > > > > which don't have NT load/store instructions. > > > I am talking about a scenario where the application is being ported > > > between architectures. Not everyone knows about the capabilities of > > > the architecture. It is better to indicate upfront (ex: compilation > > > failures) that a certain feature is not supported on the target > > > architecture rather than the user having to discover through painful > > > debugging. > > > > I'm considering rte_memcpy_nt() a performance optimized variant of > > memcpy(), where the performance gain is less cache pollution. Thus, silent > > fallback to memcpy() should suffice. > > > > Other architecture differences also affect DPDK performance; the inability > > to > > perform non-temporal load/store just one more to the (undocumented) list. > > > > Failing at build time if NT load/store is unavailable by the architecture > > would > > prevent the function from being used by other DPDK libraries, e.g. by the > > rte_pktmbuf_copy() function used by the pdump library. > The other libraries in DPDK need to provide NT versions as the libraries need > to cater for not-NT use cases as well. i.e. we cannot hide a NT copy under > rte_pktmbuf_copy() API, we need to have rte_pktmbuf_copy_nt()
Yes, it was my intention to provide rte_pktmbuf_copy_nt() as a new function. Some uses of rte_pktmbuf_copy() may benefit from having the copied data in cache. But there is a ripple effect: It is also my intention to improve the pdump and pcapng libraries by using rte_pktmbuf_copy_nt() instead of rte_pktmbuf_copy(). These would normally benefit from not polluting the cache. So the underlying rte_memcpy_nt() function needs a fallback if the architecture doesn't support non-temporal memory copy, now that the pdump and pcapng libraries depend on it. Alternatively, if rte_memcpy_nt() has no fallback to standard memcpy(), but an application fails to build if the application developer tries to use rte_memcpy_nt(), we would have to modify e.g. pdump_copy() like this: + #ifdef RTE_CPUFLAG_xxx p = rte_pktmbuf_copy_nt(pkts[i], mp, 0, cbs->snaplen); + #else p = rte_pktmbuf_copy(pkts[i], mp, 0, cbs->snaplen); + #endif Personally, I prefer the fallback inside rte_memcpy_nt(), rather than having to check for it everywhere. The developer using the pdump library will not know if the fallback is inside rte_memcpy_nt() or outside using #ifdef. It is still hidden inside pdump_copy(). > > > > > I don't oppose to your idea, I just don't have any idea how to reasonably > > implement it. So I'm trying to defend why it is not important. > I am suggesting that the applications could implement #ifdef depending on the > architecture. > I assume that it would be a pre-processor flag defined (or not) on DPDK side > and application doing #ifdef based on it? > > Another way to achieve this would be to use #warning directive (see [1]) > inside DPDK when the generic fallback is taken. > > Also isn't the argument on memcpy_nt capability query not a more general one, > that is how would/should application query DPDK's capabilities when run or > compiled? Good point! You just solved this part of the puzzle, Stanislaw: The ability to perform non-temporal memory load/store is a CPU feature. Applications that need to know if non-temporal memory access is available should check for the appropriate CPU feature flag, e.g. RTE_CPUFLAG_SSE4_1 on x86 architecture. This works both at runtime and at compile time. > > [1] https://gcc.gnu.org/onlinedocs/cpp/Diagnostics.html