On 2021-10-27 13:03, Van Haaren, Harry wrote: >> -----Original Message----- >> From: dev <dev-boun...@dpdk.org> On Behalf Of Thomas Monjalon >> Sent: Wednesday, October 27, 2021 9:13 AM >> To: Aman Kumar <aman.ku...@vvdntech.in> >> Cc: dev@dpdk.org; viachesl...@nvidia.com; Burakov, Anatoly >> <anatoly.bura...@intel.com>; keesang.s...@amd.com; >> aman.ku...@vvdntech.in; jerinjac...@gmail.com; Ananyev, Konstantin >> <konstantin.anan...@intel.com>; Richardson, Bruce >> <bruce.richard...@intel.com>; honnappa.nagaraha...@arm.com; Ruifeng Wang >> <ruifeng.w...@arm.com>; David Christensen <d...@linux.vnet.ibm.com>; >> david.march...@redhat.com; step...@networkplumber.org >> Subject: Re: [dpdk-dev] [PATCH v4 2/2] lib/eal: add temporal store memcpy >> support for AMD platform >> >> 27/10/2021 09:28, Aman Kumar: >>> This patch provides a rte_memcpy* call with temporal stores. >>> Use -Dcpu_instruction_set=znverX with build to enable this API. >>> >>> Signed-off-by: Aman Kumar <aman.ku...@vvdntech.in> >> For the series, Acked-by: Thomas Monjalon <tho...@monjalon.net> >> With the hope that such optimization will go in libc in a near future. >> >> If there is no objection, I will merge this AMD-specific series in 21.11-rc2. >> It should not affect other platforms. > Hi Folks, > > This patchset was brought to my attention, and I have a few concerns. > I'll add short snippets of context from the patch here so I can refer to it > below; > > +/** > + * Copy 16 bytes from one location to another, > + * with temporal stores > + */ > +static __rte_always_inline void > +rte_copy16_ts(uint8_t *dst, uint8_t *src) > +{ > + __m128i var128; > + > + var128 = _mm_stream_load_si128((__m128i *)src); > + _mm_storeu_si128((__m128i *)dst, var128); > +} > > 1) What is fundamentally specific to the znverX CPU? Is there any reason this > can not just be enabled for x86-64 generic with SSE4.1 ISA requirements? > _mm_stream_load_si128() is part of SSE4.1 > _mm_storeu_si128() is SSE2. > Using the intrinsics guide for lookup of intrinsics to ISA level: > https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html?wapkw=intrinsics%20guide#text=_mm_stream_load&ig_expand=6884 > > 2) Are -D options allowed to change/break API/ABI? > By allowing -Dcpu_instruction_set= to change available functions, any > application using it is no longer source-code (API) compatible with "DPDK" > proper. > This patch essentially splits a "DPDK" app to depend on "DPDK + CPU version > -D flag", in an incompatible way (no fallback?). > > 3) The stream load instruction used here *requires* 16-byte alignment for its > operand. > This is not documented, and worse, a uint8_t* is accepted, which is cast to > (__m128i *). > This cast hides the compiler warning for expanding type-alignments. > And the code itself is broken - passing a "src" parameter that is not 16-byte > aligned will segfault. > > 4) Temporal and Non-temporal are not logically presented here. > Temporal loads/stores are normal loads/stores. They use the L1/L2 caches. > Non-temporal loads/stores indicate that the data will *not* be used again in > a short space of time. > Non-temporal means "having no relation to time" according to my internet > search. > > 5) The *store* here uses a normal store (temporal, targets cache). The *load* > however is a streaming (non-temporal, no cache) load. > It is not clearly documented that A) stream load will be used. > The inverse is documented "copy with ts" aka, copy with temporal store. > Is documenting the store as temporal meant to imply that the load is > non-temporal? > > 6) What is the use-case for this? When would a user *want* to use this > instead of rte_memcpy()? > If the data being loaded is relevant to datapath/packets, presumably other > packets might require the > loaded data, so temporal (normal) loads should be used to cache the source > data?
I'm not sure if your first question is rhetorical or not, but a memcpy() in a NT variant is certainly useful. One use case for a memcpy() with temporal loads and non-temporal stores is if you need to archive packet payload for (distant, potential) future use, and want to avoid causing unnecessary LLC evictions while doing so. > 7) Why is streaming (non-temporal) loads & stores not used? I guess maybe > this is regarding the use-case, > but its not clear to me right now why loads are NT, and stores are T. > > All in all, I do not think merging this patch is a good idea. I would like to > understand the motivation for adding > this type of function, and then see it being done in a way that is clearly > documented regarding temporal loads/stores, > and not changing/adding APIs for specific CPUs. > > So apologies for late feedback, but this is not of high enough quality to be > merged to DPDK right now, NACK.