> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se] > Sent: Sunday, 7 August 2022 22.25 > > On 2022-07-19 17:26, Morten Brørup wrote: > > This RFC proposes a set of functions optimized for non-temporal > memory copy. > > > > At this stage, I am asking for feedback on the concept. > > > > Applications sometimes data to another memory location, which is only > used > > much later. > > In this case, it is inefficient to pollute the data cache with the > copied > > data. > > > > An example use case (originating from a real life application): > > Copying filtered packets, or the first part of them, into a capture > buffer > > for offline analysis. > > > > The purpose of these functions is to achieve a performance gain by > not > > polluting the cache when copying data. > > Although the throughput may be improved by further optimization, I do > not > > consider througput optimization relevant initially. > > > > The x86 non-temporal load instructions have 16 byte alignment > > requirements [1], while ARM non-temporal load instructions are > available with > > 4 byte alignment requirements [2]. > > Both platforms offer non-temporal store instructions with 4 byte > alignment > > requirements. > > > > I don't think memcpy() functions should have alignment requirements. > That's not very practical, and violates the principle of least > surprise.
I didn't make the CPUs with these alignment requirements. However, I will offer optimized performance in a generic NT memcpy() function in the cases where the individual alignment requirements of various CPUs happen to be met. > > Use normal memcpy() for the unaligned parts, and for the whole thing > for > small sizes (at least on x86). > I'm not going to plunge into some advanced vector programming, so I'm working on an implementation where misalignment is handled by using a bounce buffer (allocated on the stack, which is probably cache hot anyway).