I am looking at the source for rte_memcpy (this is a discussion only for x86-64)
For one of the cases, when aligned correctly, it uses /** * Copy 64 bytes from one location to another, * locations should not overlap. */ static __rte_always_inline void rte_mov64(uint8_t *dst, const uint8_t *src) { __m512i zmm0; zmm0 = _mm512_loadu_si512((const void *)src); _mm512_storeu_si512((void *)dst, zmm0); } I had some questions about this: 1. What I dont see is any use of x86 fence(rmb,wmb) instructions. Is that not required in this case and if not, why isnt it needed? 2. Are the mm512_loadu_si512 and _mm512_storeu_si512 non temporal? 3. Why isn't the code using stream variants, _mm512_stream_load_si512 and friends? It would not pollute the cache, so should be better - unless the required fence instructions cause a drop in performance? 4. Do the _mm512_stream_load_si512 need fence instructions? Based on my reading of the spec, the answer is yes - but wanted to confirm. TIA, Manish