Re: [RFC v2] non-temporal memcpy

Konstantin Ananyev Sun, 24 Jul 2022 06:35:28 -0700

22/07/2022 11:44, Morten Brørup пишет:

From: Konstantin Ananyev [mailto:konstantin.v.anan...@yandex.ru]
Sent: Friday, 22 July 2022 01.20


Hi Morten,

This RFC proposes a set of functions optimized for non-temporal

memory copy.


At this stage, I am asking for feedback on the concept.

Applications sometimes data to another memory location, which is only

used

much later.
In this case, it is inefficient to pollute the data cache with the

copied

data.

An example use case (originating from a real life application):
Copying filtered packets, or the first part of them, into a capture

buffer

for offline analysis.

The purpose of these functions is to achieve a performance gain by

not

polluting the cache when copying data.
Although the throughput may be improved by further optimization, I do

not

consider througput optimization relevant initially.

The x86 non-temporal load instructions have 16 byte alignment
requirements [1], while ARM non-temporal load instructions are

available with

4 byte alignment requirements [2].
Both platforms offer non-temporal store instructions with 4 byte

alignment

requirements.

In addition to the primary function without any alignment

requirements, we

also provide functions for respectivly 16 and 4 byte aligned access

for

performance purposes.

The function names resemble standard C library function names, but

their

signatures are intentionally different. No need to drag legacy into

it.


NB: Don't comment on spaces for indentation; a patch will follow DPDK

coding

style and use TAB.



I think there were discussions in other direction - remove rte_memcpy()
completely and use memcpy() instead...


Yes, the highly optimized rte_memcpy() implementation of memcpy() has become 
obsolete, now that modern compilers provide an efficient memcpy() 
implementation.

It's an excellent reference, because we should learn from it, and avoid 
introducing similar mistakes with non-temporal memcpy.

But if we have a good use case for that, then I am positive in
principle.


The standard C library doesn't offer non-temporal memcpy(), so we need to 
implement it ourselves.

Though I think we need a clear use-case within dpdk for it
to demonstrate perfomance gain.


The performance gain is to avoid polluting the data cache. DPDK example 
applications, like l3fwd, are probably too primitive to measure any benefit in 
this regard.

Probably copying packets within pdump lib, or examples/dma. or ...


Good point - the new functions should be used somewhere within DPDK. For this 
purpose, I will look into modifying rte_pktmbuf_copy(), which is used by 
pdump_copy(), to use non-temporal copying of the packet data.

Another thought - do we really need a separate inline function for each
flavour?
Might be just one non-inline rte_memcpy_nt(dst, src, size, flags),
where flags could be combination of NT_SRC, NT_DST, and keep alignment
detection/decisions to particular implementation?


Thank you for the feedback, Konstantin.

My answer to this suggestion gets a little longwinded...

Looking at the DPDK pcapng library, it copies a 4 byte aligned metadata 
structure sized 28 byte. So it can do with 4 byte aligned functions.

Our application can capture packets starting at the IP header, which is offset 
by 14 byte (Ethernet header size) from the packet buffer, so it requires 2 byte 
alignment. And thus, requiring 4 byte alignment is not acceptable.

Our application uses 16 byte alignment in the capture buffer area, and can 
benefit from 16 byte aligned functions. Furthermore, x86 processors require 16 
byte alignment for non-temporal load instructions, so I think a 16 byte aligned 
non-temporal memcpy function should be offered.



Yes, x86 needs 16B alignment for NT load/stores
But that's supposed to be arch specific limitation,
that we probably want to hide, no?
Inside the function can check alignment of both src and dst
and decide should it use NT load/store instructions or just
do normal copy.

While working on these funtions, I experimented with an rte_memcpy_nt() taking flags, 
which is also my personal preference, but haven't succeed yet. Especially when copying a 
16 byte aligned structure of only 16 byte, the overhead of the function call + comparing 
the flags + the copy loop overhead is significant, compared to inline code consisting of 
only one pair of "movntdqa (%rsi),%xmm0; movntdq %xmm0,(%rdi)" instructions.

Remember that a non-inlined rte_memcpy_nt() will be called with very varying 
size, due to the typical mix of small and big packets, so branch prediction 
will not help.

This RFC does not yet show the rte_memcpy_nt() function handling unaligned 
load/store, but it is more complex than the aligned functions. So I think the 
aligned variants are warranted - for performance reasons.

Some of the need for exposing individual functions for different alignment 
stems from the compiler being unable to determine the alignment of the source 
and destination pointers at build time. So we need to help the compiler with 
this at build time, and thus the need for inlining the function. If we expose a 
bunch of small inline functions or a big inline function with flags seems to be 
a matter of taste.

Thinking about it, you are probably right that exposing a single function with 
flags is better for documentation purposes and easier for other architectures 
to implement. But it still needs to be inline, for the reasons described above.



Ok, my initial thought was that main use-case for it would be copying of
big chunks of data, but from your description it might not be the case.
Yes, for just 16/32B copy function call overhead might be way too high...
As another alternative - would memcpy_nt_bulk() help somehow?
It can do copying for the several src/dst pairs at once and
that might help to amortize cost of function call.

[1] https://www.intel.com/content/www/us/en/docs/intrinsics-

guide/index.html#text=_mm_stream_load

[2] https://developer.arm.com/documentation/100076/0100/A64-

Instruction-Set-Reference/A64-Floating-point-Instructions/LDNP--SIMD-
and-FP-


V2:
- Only copy from non-temporal source to non-temporal destination.
    I.e. remove the two variants with only source and/or destination

being

    non-temporal.
- Do not require alignment.
    Instead, offer additional 4 and 16 byte aligned functions for

performance

    purposes.
- Implemented two of the functions for x86.
- Remove memset function.

Signed-off-by: Morten Brørup <m...@smartsharesystems.com>
---

/**
   * @warning
   * @b EXPERIMENTAL: this API may change without prior notice.
   *
   * Copy data from non-temporal source to non-temporal destination.
   *
   * @param dst
   *   Pointer to the non-temporal destination of the data.
   *   Should be 4 byte aligned, for optimal performance.
   * @param src
   *   Pointer to the non-temporal source data.
   *   No alignment requirements.
   * @param len
   *   Number of bytes to copy.
   *   Should be be divisible by 4, for optimal performance.
   */
__rte_experimental
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),

__access__(read_only, 2, 3)))

void rte_memcpy_nt(void * __rte_restrict dst, const void *

__rte_restrict src, size_t len)

/* Implementation T.B.D. */

/**
   * @warning
   * @b EXPERIMENTAL: this API may change without prior notice.
   *
   * Copy data in blocks of 16 byte from aligned non-temporal source
   * to aligned non-temporal destination.
   *
   * @param dst
   *   Pointer to the non-temporal destination of the data.
   *   Must be 16 byte aligned.
   * @param src
   *   Pointer to the non-temporal source data.
   *   Must be 16 byte aligned.
   * @param len
   *   Number of bytes to copy.
   *   Must be divisible by 16.
   */
__rte_experimental
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),

__access__(read_only, 2, 3)))

void rte_memcpy_nt16a(void * __rte_restrict dst, const void *

__rte_restrict src, size_t len)

{
      const void * const  end = RTE_PTR_ADD(src, len);

      RTE_ASSERT(rte_is_aligned(dst, sizeof(__m128i)));
      RTE_ASSERT(rte_is_aligned(src, sizeof(__m128i)));
      RTE_ASSERT(rte_is_aligned(len, sizeof(__m128i)));

      /* Copy large portion of data. */
      while (RTE_PTR_DIFF(end, src) >= 4 * sizeof(__m128i)) {
          register __m128i    xmm0, xmm1, xmm2, xmm3;

/* Note: Workaround for _mm_stream_load_si128() not taking a const

pointer as parameter. */

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
          xmm0 = _mm_stream_load_si128(RTE_PTR_ADD(src, 0 *

sizeof(__m128i)));

          xmm1 = _mm_stream_load_si128(RTE_PTR_ADD(src, 1 *

sizeof(__m128i)));

          xmm2 = _mm_stream_load_si128(RTE_PTR_ADD(src, 2 *

sizeof(__m128i)));

          xmm3 = _mm_stream_load_si128(RTE_PTR_ADD(src, 3 *

sizeof(__m128i)));

#pragma GCC diagnostic pop
          _mm_stream_si128(RTE_PTR_ADD(dst, 0 * sizeof(__m128i)),

xmm0);

          _mm_stream_si128(RTE_PTR_ADD(dst, 1 * sizeof(__m128i)),

xmm1);

          _mm_stream_si128(RTE_PTR_ADD(dst, 2 * sizeof(__m128i)),

xmm2);

          _mm_stream_si128(RTE_PTR_ADD(dst, 3 * sizeof(__m128i)),

xmm3);

          src = RTE_PTR_ADD(src, 4 * sizeof(__m128i));
          dst = RTE_PTR_ADD(dst, 4 * sizeof(__m128i));
      }

      /* Copy remaining data. */
      while (src != end) {
          register __m128i    xmm;

/* Note: Workaround for _mm_stream_load_si128() not taking a const

pointer as parameter. */

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
          xmm = _mm_stream_load_si128(src);
#pragma GCC diagnostic pop
          _mm_stream_si128(dst, xmm);
          src = RTE_PTR_ADD(src, sizeof(__m128i));
          dst = RTE_PTR_ADD(dst, sizeof(__m128i));
      }
}

/**
   * @warning
   * @b EXPERIMENTAL: this API may change without prior notice.
   *
   * Copy data in blocks of 4 byte from aligned non-temporal source
   * to aligned non-temporal destination.
   *
   * @param dst
   *   Pointer to the non-temporal destination of the data.
   *   Must be 4 byte aligned.
   * @param src
   *   Pointer to the non-temporal source data.
   *   Must be 4 byte aligned.
   * @param len
   *   Number of bytes to copy.
   *   Must be divisible by 4.
   */
__rte_experimental
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),

__access__(read_only, 2, 3)))

void rte_memcpy_nt4a(void * __rte_restrict dst, const void *

__rte_restrict src, size_t len)

{
      int32_t             buf[sizeof(__m128i) / sizeof(int32_t)]

__rte_aligned(sizeof(__m128i));

      /** Address of source data, rounded down to achieve alignment.

*/

      const void *        srca = RTE_PTR_ALIGN_FLOOR(src,

sizeof(__m128i));

      /** Address of end of source data, rounded down to achieve

alignment. */

      const void * const  srcenda =

RTE_PTR_ALIGN_FLOOR(RTE_PTR_ADD(src, len), sizeof(__m128i));

      const int           offset =  RTE_PTR_DIFF(src, srca) /

sizeof(int32_t);

      register __m128i    xmm0;

      RTE_ASSERT(rte_is_aligned(dst, sizeof(int32_t)));
      RTE_ASSERT(rte_is_aligned(src, sizeof(int32_t)));
      RTE_ASSERT(rte_is_aligned(len, sizeof(int32_t)));

      if (unlikely(len == 0)) return;

      /* Copy first, non-__m128i aligned, part of source data. */
      if (offset) {
/* Note: Workaround for _mm_stream_load_si128() not taking a const

pointer as parameter. */

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
          xmm0 = _mm_stream_load_si128(srca);
          _mm_store_si128((void *)buf, xmm0);
#pragma GCC diagnostic pop
          switch (offset) {
              case 1:
                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 *

sizeof(int32_t)), buf[1]);

                  if (unlikely(len == 1 * sizeof(int32_t))) return;
                  _mm_stream_si32(RTE_PTR_ADD(dst, 1 *

sizeof(int32_t)), buf[2]);

                  if (unlikely(len == 2 * sizeof(int32_t))) return;
                  _mm_stream_si32(RTE_PTR_ADD(dst, 2 *

sizeof(int32_t)), buf[3]);

                  break;
              case 2:
                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 *

sizeof(int32_t)), buf[2]);

                  if (unlikely(len == 1 * sizeof(int32_t))) return;
                  _mm_stream_si32(RTE_PTR_ADD(dst, 1 *

sizeof(int32_t)), buf[3]);

                  break;
              case 3:
                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 *

sizeof(int32_t)), buf[3]);

                  break;
          }
          srca = RTE_PTR_ADD(srca, (4 - offset) * sizeof(int32_t));
          dst = RTE_PTR_ADD(dst, (4 - offset) * sizeof(int32_t));
      }

      /* Copy middle, __m128i aligned, part of source data. */
      while (srca != srcenda) {
/* Note: Workaround for _mm_stream_load_si128() not taking a const

pointer as parameter. */

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
          xmm0 = _mm_stream_load_si128(srca);
#pragma GCC diagnostic pop
          _mm_store_si128((void *)buf, xmm0);
          _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)),

buf[0]);

          _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)),

buf[1]);

          _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)),

buf[2]);

          _mm_stream_si32(RTE_PTR_ADD(dst, 3 * sizeof(int32_t)),

buf[3]);

          srca = RTE_PTR_ADD(srca, sizeof(__m128i));
          dst = RTE_PTR_ADD(dst, 4 * sizeof(int32_t));
      }

      /* Copy last, non-__m128i aligned, part of source data. */
      if (RTE_PTR_DIFF(srca, src) != 4) {
/* Note: Workaround for _mm_stream_load_si128() not taking a const

pointer as parameter. */

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
          xmm0 = _mm_stream_load_si128(srca);
          _mm_store_si128((void *)buf, xmm0);
#pragma GCC diagnostic pop
          switch (offset) {
              case 1:
                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 *

sizeof(int32_t)), buf[0]);

                  break;
              case 2:
                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 *

sizeof(int32_t)), buf[0]);

                  if (unlikely(RTE_PTR_DIFF(srca, src) == 1 *

sizeof(int32_t))) return;

                  _mm_stream_si32(RTE_PTR_ADD(dst, 1 *

sizeof(int32_t)), buf[1]);

                  break;
              case 3:
                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 *

sizeof(int32_t)), buf[0]);

                  if (unlikely(RTE_PTR_DIFF(srca, src) == 1 *

sizeof(int32_t))) return;

                  _mm_stream_si32(RTE_PTR_ADD(dst, 1 *

sizeof(int32_t)), buf[1]);

                  if (unlikely(RTE_PTR_DIFF(srca, src) == 2 *

sizeof(int32_t))) return;

                  _mm_stream_si32(RTE_PTR_ADD(dst, 2 *

sizeof(int32_t)), buf[2]);

Re: [RFC v2] non-temporal memcpy

Reply via email to