On 2024-05-28 10:27, Bruce Richardson wrote:
On Tue, May 28, 2024 at 10:19:15AM +0200, Mattias Rönnblom wrote:
On 2024-05-28 09:43, Mattias Rönnblom wrote:
Provide build option to have functions in <rte_memcpy.h> delegate to
the standard compiler/libc memcpy(), instead of using the various
traditional, handcrafted, per-architecture rte_memcpy()
implementations.
A new meson build option 'use_cc_memcpy' is added. The default is
true. It's not obvious what should be the default, but compiler
memcpy() is enabled by default in this RFC so any tests run with this
patch use the new approach.
One purpose of this RFC is to make it easy to evaluate the costs and
benefits of a switch.
I've tested this patch some with DSW micro benchmarks, and the result is a
2.5% reduction of the DSW+testapp overhead with cc/libc memcpy. GCC 11.4.
We've also run characteristic test suite of a large, real world app. Here,
we saw no effect. GCC 10.5.
x86_64 in both cases (Skylake and Raptor Lake).
Last time we did the same, there were a noticeable performance degradation
in both the above cases.
This is not a lot of data points, but I think it we should consider making
the custom RTE memcpy() implementations optional in the next release, and if
no-one complains, remove the implementations in the next release.
(Whether or not [or how long] to keep the wrapper API is another question.)
<snip>
The other instance I've heard mention of in the past is virtio/vhost, which
used to have a speedup from the custom memcpy.
My own thinking on these cases, is that for targetted settings like these,
we should look to have local memcpy functions written - taking account of
the specifics of each usecase. For virtio/vhost for example, we can have
assumptions around host buffer alignment, and we also can be pretty
confident we are copying to another CPU. For DSW, or other eventdev cases,
we would only be looking at copies of multiples of 16, with guaranteed
8-byte alignment on both source and destination. Writing efficient copy fns
In such cases, you should first try to tell the compiler that it's safe
to assume that the pointers have a certain alignment.
void copy256(void *dst, const void *src)
{
memcpy(dst, src, 256);
}
void copy256_a(void *dst, const void *src)
{
void *dst_a = __builtin_assume_aligned(dst, 32);
const void *src_a = __builtin_assume_aligned(src, 32);
memcpy(dst_a, src_a, 256);
}
The first will generate loads/stores without alignment restrictions,
while the latter will use things like vmovdqa or vmovaps.
(I doubt there's much of a performance difference though, if any at all.)
for specific scenarios can be faster and more effective than trying to
write a general, optimized in all cases, memcpy. It also discourages the
use of non-libc memcpy except where really necessary.
Naturally, if we find there are a lot of cases where use of libc memcpy
slows us down, we will want to keep a general rte_memcpy. However, I'd hope
the slowdown cases are very few.
/Bruce