From: Aman Kumar <aman.ku...@vvdntech.in> Sent: Wednesday, October 27, 2021 2:35 PM To: Ananyev, Konstantin <konstantin.anan...@intel.com> Cc: Van Haaren, Harry <harry.van.haa...@intel.com>; mattias.ronnblom <mattias.ronnb...@ericsson.com>; Thomas Monjalon <tho...@monjalon.net>; dev@dpdk.org; viachesl...@nvidia.com; Burakov, Anatoly <anatoly.bura...@intel.com>; Song, Keesang <keesang.s...@amd.com>; jerinjac...@gmail.com; Richardson, Bruce <bruce.richard...@intel.com>; honnappa.nagaraha...@arm.com; Ruifeng Wang <ruifeng.w...@arm.com>; David Christensen <d...@linux.vnet.ibm.com>; david.march...@redhat.com; step...@networkplumber.org Subject: Re: [dpdk-dev] [PATCH v4 2/2] lib/eal: add temporal store memcpy support for AMD platform
>> >> Hi Mattias, >> >> > > 6) What is the use-case for this? When would a user *want* to use this >> > > instead >> > of rte_memcpy()? >> > > If the data being loaded is relevant to datapath/packets, presumably >> > > other >> > packets might require the >> > > loaded data, so temporal (normal) loads should be used to cache the >> > > source >> > data? >> > >> > >> > I'm not sure if your first question is rhetorical or not, but a memcpy() >> > in a NT variant is certainly useful. One use case for a memcpy() with >> > temporal loads and non-temporal stores is if you need to archive packet >> > payload for (distant, potential) future use, and want to avoid causing >> > unnecessary LLC evictions while doing so. >> >> Yes I agree that there are certainly benefits in using cache-locality hints. >> There is an open question around if the src or dst or both are non-temporal. >> >> In the implementation of this patch, the NT/T type of store is reversed from >> your use-case: >> 1) Loads are NT (so loaded data is not cached for future packets) >> 2) Stores are T (so copied/dst data is now resident in L1/L2) >> >> In theory there might even be valid uses for this type of memcpy where loaded >> data is not needed again soon and stored data is referenced again soon, >> although I cannot think of any here while typing this mail.. >> >> I think some use-case examples, and clear documentation on when/how to choose >> between rte_memcpy() or any (potential future) rte_memcpy_nt() variants is >> required >> to progress this patch. >> >> Assuming a strong use-case exists, and it can be clearly indicators to users >> of DPDK APIs which >> rte_memcpy() to use, we can look at technical details around enabling the >> implementation. >> > > +1 here. > Function behaviour and restrictions (src parameter needs to be 16/32 B > aligned, etc.), > along with expected usage scenarios have to be documented properly. > Again, as Harry pointed out, I don't see any AMD specific instructions in > this function, > so presumably such function can go into __AVX2__ code block and no new > defines will > be required. > Agreed that APIs are generic but we've kept under an AMD flag for a simple > reason that it is NOT tested on any other platform. > A use-case on how to use this was planned earlier for mlx5 pmd but dropped in > this version of patch as the data path of mlx5 is going to be refactored soon > and may not be useful for > future versions of mlx5 (>22.02). > Ref link: > https://patchwork.dpdk.org/project/dpdk/patch/20211019104724.19416-2-aman.ku...@vvdntech.in/(we've > plan to adapt this into future version) > The patch in the link basically enhances mlx5 mprq implementation for our > specific use-case and with 128B packet size, we achieve ~60% better perf. We > understand the use of this > copy function should be documented which we shall plan along with few other > platform specific optimizations in future versions of DPDK. As this does not > conflict with other >platforms, can we still keep under AMD flag for now as > suggested by Thomas? From what I read above the patch is sort of in the half-ready stage. Why to rush here and try to push to the DPDK things that doesn't full-fill DPDK policy? Probably better to do all missing parts first (docs, tests, etc.) and then come-up with updated version.