> -----Original Message-----
> From: dev <dev-boun...@dpdk.org> On Behalf Of Thomas Monjalon
> Sent: Wednesday, October 27, 2021 9:13 AM
> To: Aman Kumar <aman.ku...@vvdntech.in>
> Cc: dev@dpdk.org; viachesl...@nvidia.com; Burakov, Anatoly
> <anatoly.bura...@intel.com>; keesang.s...@amd.com;
> aman.ku...@vvdntech.in; jerinjac...@gmail.com; Ananyev, Konstantin
> <konstantin.anan...@intel.com>; Richardson, Bruce
> <bruce.richard...@intel.com>; honnappa.nagaraha...@arm.com; Ruifeng Wang
> <ruifeng.w...@arm.com>; David Christensen <d...@linux.vnet.ibm.com>;
> david.march...@redhat.com; step...@networkplumber.org
> Subject: Re: [dpdk-dev] [PATCH v4 2/2] lib/eal: add temporal store memcpy
> support for AMD platform
> 
> 27/10/2021 09:28, Aman Kumar:
> > This patch provides a rte_memcpy* call with temporal stores.
> > Use -Dcpu_instruction_set=znverX with build to enable this API.
> >
> > Signed-off-by: Aman Kumar <aman.ku...@vvdntech.in>
> 
> For the series, Acked-by: Thomas Monjalon <tho...@monjalon.net>
> With the hope that such optimization will go in libc in a near future.
> 
> If there is no objection, I will merge this AMD-specific series in 21.11-rc2.
> It should not affect other platforms.

Hi Folks,

This patchset was brought to my attention, and I have a few concerns.
I'll add short snippets of context from the patch here so I can refer to it 
below;

+/**
+ * Copy 16 bytes from one location to another,
+ * with temporal stores
+ */
+static __rte_always_inline void
+rte_copy16_ts(uint8_t *dst, uint8_t *src)
+{
+       __m128i var128;
+
+       var128 = _mm_stream_load_si128((__m128i *)src);
+       _mm_storeu_si128((__m128i *)dst, var128);
+}

1) What is fundamentally specific to the znverX CPU? Is there any reason this 
can not just be enabled for x86-64 generic with SSE4.1 ISA requirements?
_mm_stream_load_si128() is part of SSE4.1
_mm_storeu_si128() is SSE2. 
Using the intrinsics guide for lookup of intrinsics to ISA level: 
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html?wapkw=intrinsics%20guide#text=_mm_stream_load&ig_expand=6884

2) Are -D options allowed to change/break API/ABI?
By allowing -Dcpu_instruction_set= to change available functions, any 
application using it is no longer source-code (API) compatible with "DPDK" 
proper.
This patch essentially splits a "DPDK" app to depend on "DPDK + CPU version -D 
flag", in an incompatible way (no fallback?).

3) The stream load instruction used here *requires* 16-byte alignment for its 
operand.
This is not documented, and worse, a uint8_t* is accepted, which is cast to 
(__m128i *).
This cast hides the compiler warning for expanding type-alignments.
And the code itself is broken - passing a "src" parameter that is not 16-byte 
aligned will segfault.

4) Temporal and Non-temporal are not logically presented here.
Temporal loads/stores are normal loads/stores. They use the L1/L2 caches.
Non-temporal loads/stores indicate that the data will *not* be used again in a 
short space of time.
Non-temporal means "having no relation to time" according to my internet search.

5) The *store* here uses a normal store (temporal, targets cache). The *load* 
however is a streaming (non-temporal, no cache) load.
It is not clearly documented that A) stream load will be used.
The inverse is documented "copy with ts" aka, copy with temporal store.
Is documenting the store as temporal meant to imply that the load is 
non-temporal?

6) What is the use-case for this? When would a user *want* to use this instead 
of rte_memcpy()?
If the data being loaded is relevant to datapath/packets, presumably other 
packets might require the
loaded data, so temporal (normal) loads should be used to cache the source data?

7) Why is streaming (non-temporal) loads & stores not used? I guess maybe this 
is regarding the use-case,
but its not clear to me right now why loads are NT, and stores are T.

All in all, I do not think merging this patch is a good idea. I would like to 
understand the motivation for adding
this type of function, and then see it being done in a way that is clearly 
documented regarding temporal loads/stores,
and not changing/adding APIs for specific CPUs.

So apologies for late feedback, but this is not of high enough quality to be 
merged to DPDK right now, NACK.

Reply via email to