> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se]
> Sent: Tuesday, 9 August 2022 13.53
> 
> On 2022-08-09 11:24, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se]
> >> Sent: Sunday, 7 August 2022 22.41
> >>
> >> On 2022-07-29 18:05, Stephen Hemminger wrote:
> >>>
> >>> It makes sense in a few select places to use non-temporal copy.
> >>> But it would add unnecessary complexity to DPDK if every function
> in
> >> DPDK that could
> >>> cause a copy had a non-temporal variant.
> >>
> >> A NT load and NT store variant, plus a NT load+store variant. :)
> >
> > I considered this, but it adds complexity, and our use case only
> needs the NT load+store. So I decided to only provide that variant.
> >
> > I can prepare the API for all four combinations. The extended
> function would be renamed from rte_memcpy_nt_ex() to just
> rte_memcpy_ex(). And the rte_memcpy_nt() would be omitted, rather than
> just perform rte_memcpy_ex(dst,src,len,F_DST_NT|F_SRC_NT).
> >
> > What does the community prefer in this regard?
> >
> 
> I would suggest just having a single function, with a flags or an enum
> to signify, if load, store or both should be non-temporal. If all
> platforms honor all combinations is a different matter.

Good input, thank you!

I have finally released a patch, and am iterating through versions to fix minor 
bugs detected by the CI system.

The public API is now a single rte_memcpy_ex(dst, src, len, flags) function, 
where the flags are also used to request non-temporal load and/or store.

> 
> Is there something that suggests that this particular use case will be
> more common than others? When I've used non-temporal memcpy(), only the
> store side was NT, since the application would go on an use the source
> data.

OK. For completeness, all three variants are now implemented: NT destination, 
NT source, and NT source and destination.

> 
> >>
> >>>
> >>> Maybe just having rte_memcpy have a threshold (config value?) that
> if
> >> copy is larger than
> >>> a certain size, then it would automatically be non-temporal.  Small
> >> copies wouldn't matter,
> >>> the optimization is more about not stopping cache size issues with
> >> large streams of data.
> >>
> >> I don't think there's any way for rte_memcpy() to know if the
> >> application plan to use the source, the destination, both, or
> neither
> >> of
> >> the buffers in the immediate future.
> >
> > Agree. Which is why explicit NT function variants should be offered.
> >
> >> For huge copies (MBs or more) the
> >> size heuristic makes sense, but for medium sized copies (say a
> packet
> >> worth of data), I'm not so sure.
> >
> > This is the behavior of glibc memcpy().
> >
> 
> Yes, but, from what I can tell, glibc issues a sfence at the end of the
> copy.
> 
> Have a non-temporal memcpy() with a different memory model than the
> compiler intrinsic memcpy(), the glibc memcpy() and the DPDK
> rte_memcpy() implementations seems like asking for trouble.
> 
> >>
> >> What is unclear to me is if there is a benefit (or drawback) of
> using
> >> the imaginary rte_memcpy_nt(), compared to doing rte_memcpy() +
> >> clflushopt or cldemote, in the typical use case (if there is such).
> >>
> >
> > Our use case is packet capture (copying) to memory, where the copies
> will be read much later, so there is no need to pollute the cache with
> the copies.
> >
> 
> If you flush/demote the cache line you've used more or less
> immediately,
> there won't be much pollution. Especially if you include the
> clflushopt/cldemote into the copying routine, as opposed to a large
> flush at the end.

The source data may already be in cache, and some applications might continue 
using it after the non-temporal memcpy; in this case, flushing the source data 
cache would be counterproductive.

However, flushing the destination cache might be simpler than using the 
non-temporal store instructions. Unfortunately, I didn't have time to explore 
this alternative.

> 
> I haven't tried this in practice, but it seems to me it's an option
> worth exploring. It could be a way to implement a portable NT memcpy(),
> if nothing else.
> 
> > Our application also doesn't look deep inside the original packets
> after copying them, there is also no need to pollute the cache with the
> originals.
> >
> 
> See above.
> 
> > And even though the application looked partially into the packets
> before copying them (and thus they are partially in cache) using NT
> load (instead of normal load) has no additional cost.
> >

Reply via email to