Re: [RFC v2] non-temporal memcpy

Mattias Rönnblom Wed, 10 Aug 2022 04:47:54 -0700

On 2022-08-09 17:00, Morten Brørup wrote:

From: Mattias Rönnblom [mailto:hof...@lysator.liu.se]
Sent: Tuesday, 9 August 2022 14.05


On 2022-08-09 11:46, Morten Brørup wrote:

From: Mattias Rönnblom [mailto:hof...@lysator.liu.se]
Sent: Sunday, 7 August 2022 22.25

On 2022-07-19 17:26, Morten Brørup wrote:

This RFC proposes a set of functions optimized for non-temporal

memory copy.


At this stage, I am asking for feedback on the concept.

Applications sometimes data to another memory location, which is

only

used

much later.
In this case, it is inefficient to pollute the data cache with the

copied

data.

An example use case (originating from a real life application):
Copying filtered packets, or the first part of them, into a capture

buffer

for offline analysis.

The purpose of these functions is to achieve a performance gain by

not

polluting the cache when copying data.
Although the throughput may be improved by further optimization, I

do

not

consider througput optimization relevant initially.

The x86 non-temporal load instructions have 16 byte alignment
requirements [1], while ARM non-temporal load instructions are

available with

4 byte alignment requirements [2].
Both platforms offer non-temporal store instructions with 4 byte

alignment

requirements.


I don't think memcpy() functions should have alignment requirements.
That's not very practical, and violates the principle of least
surprise.


I didn't make the CPUs with these alignment requirements.

However, I will offer optimized performance in a generic NT memcpy()

function in the cases where the individual alignment requirements of
various CPUs happen to be met.


Use normal memcpy() for the unaligned parts, and for the whole thing
for
small sizes (at least on x86).


I'm not going to plunge into some advanced vector programming, so I'm

working on an implementation where misalignment is handled by using a
bounce buffer (allocated on the stack, which is probably cache hot
anyway).


I don't know for the NT load + NT store case, but for regular load + NT
store, this is trivial. The implementation I've used is 36
straight-forward lines of code.


Is that implementation available for inspiration anywhere?

#define NT_THRESHOLD (2 * CACHE_LINE_SIZE)

void nt_memcpy(void *__restrict dst, const void * __restrict src, size_t n)
{
        if (n < NT_THRESHOLD) {
                memcpy(dst, src, n);
                return;
        }

        size_t n_unaligned = CACHE_LINE_SIZE - (uintptr_t)dst % CACHE_LINE_SIZE;

        if (n_unaligned > n)
                n_unaligned = n;

        memcpy(dst, src, n_unaligned);
        dst += n_unaligned;
        src += n_unaligned;
        n -= n_unaligned;

        size_t num_lines = n / CACHE_LINE_SIZE;

        size_t i;
        for (i = 0; i < num_lines; i++) {
                size_t j;
                for (j = 0; j < CACHE_LINE_SIZE / sizeof(__m128i); j++) {
                        __m128i blk = _mm_loadu_si128((const __m128i *)src);
                        /* non-temporal store */
                        _mm_stream_si128((__m128i *)dst, blk);
                        src += sizeof(__m128i);
                        dst += sizeof(__m128i);
                }
                n -= CACHE_LINE_SIZE;
        }

        if (num_lines > 0)
                _mm_sfence();

        memcpy(dst, src, n);
}

(This was written as a part of a benchmark exercise, and hasn't beenproperly tested.)

Use this for inspiration, or I can DPDK-ify this and make it a properpatch/RFC. I would try to add support for NT load as well, and make bothNT load and store depend on flags parameter.

The above threshold setting is completely arbitrary. What you shouldkeep in mind when thinking about the threshold, is that it might well beworth to suffer a little lower performance of NT store + sfence(compared to regular store), since you will benefit from not trashingthe cache.

For example, back-to-back copying of 1500 bytes buffers with thiscopying routine is much slower than regular memcpy() (measured in thecore cycles spent in the copying), but nevertheless in a real-worldapplication it may still improve the overall performance, since thepacket copies doesn't evict useful data from the various caches. I knowfor sure that certain applications do benefit.

Re: [RFC v2] non-temporal memcpy

Reply via email to