RE: [RFC v2] non-temporal memcpy

Morten Brørup Tue, 09 Aug 2022 08:00:55 -0700

> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se]
> Sent: Tuesday, 9 August 2022 14.05
> 
> On 2022-08-09 11:46, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se]
> >> Sent: Sunday, 7 August 2022 22.25
> >>
> >> On 2022-07-19 17:26, Morten Brørup wrote:
> >>> This RFC proposes a set of functions optimized for non-temporal
> >> memory copy.
> >>>
> >>> At this stage, I am asking for feedback on the concept.
> >>>
> >>> Applications sometimes data to another memory location, which is
> only
> >> used
> >>> much later.
> >>> In this case, it is inefficient to pollute the data cache with the
> >> copied
> >>> data.
> >>>
> >>> An example use case (originating from a real life application):
> >>> Copying filtered packets, or the first part of them, into a capture
> >> buffer
> >>> for offline analysis.
> >>>
> >>> The purpose of these functions is to achieve a performance gain by
> >> not
> >>> polluting the cache when copying data.
> >>> Although the throughput may be improved by further optimization, I
> do
> >> not
> >>> consider througput optimization relevant initially.
> >>>
> >>> The x86 non-temporal load instructions have 16 byte alignment
> >>> requirements [1], while ARM non-temporal load instructions are
> >> available with
> >>> 4 byte alignment requirements [2].
> >>> Both platforms offer non-temporal store instructions with 4 byte
> >> alignment
> >>> requirements.
> >>>
> >>
> >> I don't think memcpy() functions should have alignment requirements.
> >> That's not very practical, and violates the principle of least
> >> surprise.
> >
> > I didn't make the CPUs with these alignment requirements.
> >
> > However, I will offer optimized performance in a generic NT memcpy()
> function in the cases where the individual alignment requirements of
> various CPUs happen to be met.
> >
> >>
> >> Use normal memcpy() for the unaligned parts, and for the whole thing
> >> for
> >> small sizes (at least on x86).
> >>
> >
> > I'm not going to plunge into some advanced vector programming, so I'm
> working on an implementation where misalignment is handled by using a
> bounce buffer (allocated on the stack, which is probably cache hot
> anyway).
> >
> >
> 
> I don't know for the NT load + NT store case, but for regular load + NT
> store, this is trivial. The implementation I've used is 36
> straight-forward lines of code.


Is that implementation available for inspiration anywhere?

RE: [RFC v2] non-temporal memcpy

Reply via email to