On 2022-07-29 18:05, Stephen Hemminger wrote:
On Fri, 29 Jul 2022 12:13:52 +0000
Konstantin Ananyev <konstantin.anan...@huawei.com> wrote:
Sorry, missed that part.
Another question - who will do 'sfence' after the copying?
Would it be inside memcpy_nt (seems quite costly), or would
it be another API function for that: memcpy_nt_flush() or so?
Outside. Only the developer knows when it is required, so it wouldn't make any
sense to add the cost inside memcpy_nt().
I don't think we should add a flush function; it would just be another name for
an already existing function. Referring to the required
operation in the memcpy_nt() function documentation should suffice.
Ok, but again wouldn't it be arch specific?
AFAIK for x86 it needs to boil down to sfence, for other architectures - I
don't know.
If you think there already is some generic one (rte_wmb?) that would always
produce
correct instructions - sure let's use it.
It makes sense in a few select places to use non-temporal copy.
But it would add unnecessary complexity to DPDK if every function in DPDK that
could
cause a copy had a non-temporal variant.
A NT load and NT store variant, plus a NT load+store variant. :)
Maybe just having rte_memcpy have a threshold (config value?) that if copy is
larger than
a certain size, then it would automatically be non-temporal. Small copies
wouldn't matter,
the optimization is more about not stopping cache size issues with large
streams of data.
I don't think there's any way for rte_memcpy() to know if the
application plan to use the source, the destination, both, or neither of
the buffers in the immediate future. For huge copies (MBs or more) the
size heuristic makes sense, but for medium sized copies (say a packet
worth of data), I'm not so sure.
What is unclear to me is if there is a benefit (or drawback) of using
the imaginary rte_memcpy_nt(), compared to doing rte_memcpy() +
clflushopt or cldemote, in the typical use case (if there is such).