Hi,
I must say: greate work.
I have some small comments:
> +/**
> + * Macro for copying unaligned block from one location to another,
> + * 47 bytes leftover maximum,
> + * locations should not overlap.
> + * Requirements:
> + * - Store is aligned
> + * - Load offset is <offset>, which must be immediate value within [1, 15]
> + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards
> are available for loading
> + * - <dst>, <src>, <len> must be variables
> + * - __m128i <xmm0> ~ <xmm8> must be pre-defined
> + */
> +#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)
> \
> +{
> \
...
> +}
Why not do { ... } while(0) or ({ ... }) ? This could have unpredictable side
effects.
Second:
Why you completely substitute
#define rte_memcpy(dst, src, n) \
({ (__builtin_constant_p(n)) ? \
memcpy((dst), (src), (n)) : \
rte_memcpy_func((dst), (src), (n)); })
with inline rte_memcpy()? This construction can help compiler to deduce
which version to use (static?) inline implementation or call external
function.
Did you try 'extern inline' type? It could help reducing compilation time.