On 2022-08-09 11:46, Morten Brørup wrote:
From: Mattias Rönnblom [mailto:hof...@lysator.liu.se]
Sent: Sunday, 7 August 2022 22.25

On 2022-07-19 17:26, Morten Brørup wrote:
This RFC proposes a set of functions optimized for non-temporal
memory copy.

At this stage, I am asking for feedback on the concept.

Applications sometimes data to another memory location, which is only
used
much later.
In this case, it is inefficient to pollute the data cache with the
copied
data.

An example use case (originating from a real life application):
Copying filtered packets, or the first part of them, into a capture
buffer
for offline analysis.

The purpose of these functions is to achieve a performance gain by
not
polluting the cache when copying data.
Although the throughput may be improved by further optimization, I do
not
consider througput optimization relevant initially.

The x86 non-temporal load instructions have 16 byte alignment
requirements [1], while ARM non-temporal load instructions are
available with
4 byte alignment requirements [2].
Both platforms offer non-temporal store instructions with 4 byte
alignment
requirements.


I don't think memcpy() functions should have alignment requirements.
That's not very practical, and violates the principle of least
surprise.

I didn't make the CPUs with these alignment requirements.

However, I will offer optimized performance in a generic NT memcpy() function 
in the cases where the individual alignment requirements of various CPUs happen 
to be met.


Use normal memcpy() for the unaligned parts, and for the whole thing
for
small sizes (at least on x86).


I'm not going to plunge into some advanced vector programming, so I'm working 
on an implementation where misalignment is handled by using a bounce buffer 
(allocated on the stack, which is probably cache hot anyway).



I don't know for the NT load + NT store case, but for regular load + NT store, this is trivial. The implementation I've used is 36 straight-forward lines of code.

Reply via email to