[dpdk-dev] rte_memcpy - fence and stream

Manish Sharma Mon, 24 May 2021 11:13:45 -0700

I am looking at the source for rte_memcpy (this is a discussion only for
x86-64)


For one of the cases, when aligned correctly, it uses

/**
 * Copy 64 bytes from one location to another,
 * locations should not overlap.
 */
static __rte_always_inline void
rte_mov64(uint8_t *dst, const uint8_t *src)
{
        __m512i zmm0;

        zmm0 = _mm512_loadu_si512((const void *)src);
        _mm512_storeu_si512((void *)dst, zmm0);
}

I had some questions about this:

1. What I dont see is any use of x86 fence(rmb,wmb) instructions. Is that
not required in this case and if not, why isnt it needed?

2. Are the  mm512_loadu_si512 and  _mm512_storeu_si512 non temporal?

3. Why isn't the code using  stream variants, _mm512_stream_load_si512 and
friends? It would not pollute the cache, so should be better - unless the
required fence instructions cause a drop in performance?

4. Do the _mm512_stream_load_si512 need fence instructions? Based on my
reading of the spec, the answer is yes - but wanted to confirm.

TIA,
Manish

[dpdk-dev] rte_memcpy - fence and stream

Reply via email to