> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se] > Sent: Tuesday, 9 August 2022 14.05 > > On 2022-08-09 11:46, Morten Brørup wrote: > >> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se] > >> Sent: Sunday, 7 August 2022 22.25 > >> > >> On 2022-07-19 17:26, Morten Brørup wrote: > >>> This RFC proposes a set of functions optimized for non-temporal > >> memory copy. > >>> > >>> At this stage, I am asking for feedback on the concept. > >>> > >>> Applications sometimes data to another memory location, which is > only > >> used > >>> much later. > >>> In this case, it is inefficient to pollute the data cache with the > >> copied > >>> data. > >>> > >>> An example use case (originating from a real life application): > >>> Copying filtered packets, or the first part of them, into a capture > >> buffer > >>> for offline analysis. > >>> > >>> The purpose of these functions is to achieve a performance gain by > >> not > >>> polluting the cache when copying data. > >>> Although the throughput may be improved by further optimization, I > do > >> not > >>> consider througput optimization relevant initially. > >>> > >>> The x86 non-temporal load instructions have 16 byte alignment > >>> requirements [1], while ARM non-temporal load instructions are > >> available with > >>> 4 byte alignment requirements [2]. > >>> Both platforms offer non-temporal store instructions with 4 byte > >> alignment > >>> requirements. > >>> > >> > >> I don't think memcpy() functions should have alignment requirements. > >> That's not very practical, and violates the principle of least > >> surprise. > > > > I didn't make the CPUs with these alignment requirements. > > > > However, I will offer optimized performance in a generic NT memcpy() > function in the cases where the individual alignment requirements of > various CPUs happen to be met. > > > >> > >> Use normal memcpy() for the unaligned parts, and for the whole thing > >> for > >> small sizes (at least on x86). > >> > > > > I'm not going to plunge into some advanced vector programming, so I'm > working on an implementation where misalignment is handled by using a > bounce buffer (allocated on the stack, which is probably cache hot > anyway). > > > > > > I don't know for the NT load + NT store case, but for regular load + NT > store, this is trivial. The implementation I've used is 36 > straight-forward lines of code.
Is that implementation available for inspiration anywhere?