> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se]
> Sent: Sunday, 7 August 2022 22.25
> 
> On 2022-07-19 17:26, Morten Brørup wrote:
> > This RFC proposes a set of functions optimized for non-temporal
> memory copy.
> >
> > At this stage, I am asking for feedback on the concept.
> >
> > Applications sometimes data to another memory location, which is only
> used
> > much later.
> > In this case, it is inefficient to pollute the data cache with the
> copied
> > data.
> >
> > An example use case (originating from a real life application):
> > Copying filtered packets, or the first part of them, into a capture
> buffer
> > for offline analysis.
> >
> > The purpose of these functions is to achieve a performance gain by
> not
> > polluting the cache when copying data.
> > Although the throughput may be improved by further optimization, I do
> not
> > consider througput optimization relevant initially.
> >
> > The x86 non-temporal load instructions have 16 byte alignment
> > requirements [1], while ARM non-temporal load instructions are
> available with
> > 4 byte alignment requirements [2].
> > Both platforms offer non-temporal store instructions with 4 byte
> alignment
> > requirements.
> >
> 
> I don't think memcpy() functions should have alignment requirements.
> That's not very practical, and violates the principle of least
> surprise.

I didn't make the CPUs with these alignment requirements.

However, I will offer optimized performance in a generic NT memcpy() function 
in the cases where the individual alignment requirements of various CPUs happen 
to be met.

> 
> Use normal memcpy() for the unaligned parts, and for the whole thing
> for
> small sizes (at least on x86).
> 

I'm not going to plunge into some advanced vector programming, so I'm working 
on an implementation where misalignment is handled by using a bounce buffer 
(allocated on the stack, which is probably cache hot anyway).


Reply via email to