> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se] > Sent: Tuesday, 9 August 2022 13.53 > > On 2022-08-09 11:24, Morten Brørup wrote: > >> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se] > >> Sent: Sunday, 7 August 2022 22.41 > >> > >> On 2022-07-29 18:05, Stephen Hemminger wrote: > >>> > >>> It makes sense in a few select places to use non-temporal copy. > >>> But it would add unnecessary complexity to DPDK if every function > in > >> DPDK that could > >>> cause a copy had a non-temporal variant. > >> > >> A NT load and NT store variant, plus a NT load+store variant. :) > > > > I considered this, but it adds complexity, and our use case only > needs the NT load+store. So I decided to only provide that variant. > > > > I can prepare the API for all four combinations. The extended > function would be renamed from rte_memcpy_nt_ex() to just > rte_memcpy_ex(). And the rte_memcpy_nt() would be omitted, rather than > just perform rte_memcpy_ex(dst,src,len,F_DST_NT|F_SRC_NT). > > > > What does the community prefer in this regard? > > > > I would suggest just having a single function, with a flags or an enum > to signify, if load, store or both should be non-temporal. If all > platforms honor all combinations is a different matter.
Good input, thank you! I have finally released a patch, and am iterating through versions to fix minor bugs detected by the CI system. The public API is now a single rte_memcpy_ex(dst, src, len, flags) function, where the flags are also used to request non-temporal load and/or store. > > Is there something that suggests that this particular use case will be > more common than others? When I've used non-temporal memcpy(), only the > store side was NT, since the application would go on an use the source > data. OK. For completeness, all three variants are now implemented: NT destination, NT source, and NT source and destination. > > >> > >>> > >>> Maybe just having rte_memcpy have a threshold (config value?) that > if > >> copy is larger than > >>> a certain size, then it would automatically be non-temporal. Small > >> copies wouldn't matter, > >>> the optimization is more about not stopping cache size issues with > >> large streams of data. > >> > >> I don't think there's any way for rte_memcpy() to know if the > >> application plan to use the source, the destination, both, or > neither > >> of > >> the buffers in the immediate future. > > > > Agree. Which is why explicit NT function variants should be offered. > > > >> For huge copies (MBs or more) the > >> size heuristic makes sense, but for medium sized copies (say a > packet > >> worth of data), I'm not so sure. > > > > This is the behavior of glibc memcpy(). > > > > Yes, but, from what I can tell, glibc issues a sfence at the end of the > copy. > > Have a non-temporal memcpy() with a different memory model than the > compiler intrinsic memcpy(), the glibc memcpy() and the DPDK > rte_memcpy() implementations seems like asking for trouble. > > >> > >> What is unclear to me is if there is a benefit (or drawback) of > using > >> the imaginary rte_memcpy_nt(), compared to doing rte_memcpy() + > >> clflushopt or cldemote, in the typical use case (if there is such). > >> > > > > Our use case is packet capture (copying) to memory, where the copies > will be read much later, so there is no need to pollute the cache with > the copies. > > > > If you flush/demote the cache line you've used more or less > immediately, > there won't be much pollution. Especially if you include the > clflushopt/cldemote into the copying routine, as opposed to a large > flush at the end. The source data may already be in cache, and some applications might continue using it after the non-temporal memcpy; in this case, flushing the source data cache would be counterproductive. However, flushing the destination cache might be simpler than using the non-temporal store instructions. Unfortunately, I didn't have time to explore this alternative. > > I haven't tried this in practice, but it seems to me it's an option > worth exploring. It could be a way to implement a portable NT memcpy(), > if nothing else. > > > Our application also doesn't look deep inside the original packets > after copying them, there is also no need to pollute the cache with the > originals. > > > > See above. > > > And even though the application looked partially into the packets > before copying them (and thus they are partially in cache) using NT > load (instead of normal load) has no additional cost. > >