On 2024-07-29 13:00, Morten Brørup wrote:
From: Mattias Rönnblom [mailto:mattias.ronnb...@ericsson.com]
Sent: Wednesday, 24 July 2024 09.54

Which packet mix was used for your tests? Synthetic IMIX, or some live data?


I used the same test as was being done when the performance regression was demonstrated (i.e., 2x testpmd with fixed packet size).

+/* The code generated by GCC (and to a lesser extent, clang) with just
+ * a straight memcpy() to copy packets is less than optimal on Intel
+ * P-cores, for small packets. Thus the need of this specialized
+ * memcpy() in builds where use_cc_memcpy is set to true.
+ */
+#if defined(RTE_USE_CC_MEMCPY) && defined(RTE_ARCH_X86_64)
+static __rte_always_inline void
+pktcpy(void *restrict in_dst, const void *restrict in_src, size_t len)
+{
+       void *dst = __builtin_assume_aligned(in_dst, 16);
+       const void *src = __builtin_assume_aligned(in_src, 16);
+
+       if (len <= 256) {
+               size_t left;
+
+               for (left = len; left >= 32; left -= 32) {
+                       memcpy(dst, src, 32);
+                       dst = RTE_PTR_ADD(dst, 32);
+                       src = RTE_PTR_ADD(src, 32);
+               }
+
+               memcpy(dst, src, left);
+       } else

Although the packets within a burst often have similar size, I'm not sure you 
can rely on the dynamic branch predictor here.


I agree that the pktcpy() routine will likely often suffer a size-related branch mispredict with real packet size mix. A benchmark with a real packet mix would be much better than the tests I've run.

This needs to be compared, of course, with the overhead imposed by conditionals included in other pktcpy() implementations.

Looking at the ethdev packet size counters at an ISP (at the core of their 
Layer 3 network), 71 % are 256 byte or larger [1].

For static branch prediction, I would consider > 256 more likely and swap the two 
branches, i.e. compare (len > 256) instead of (len <= 256).


OK, I'll add likely() instead, to make it more explicit.

But again: I don't know how the dynamic branch predictor behaves here. Perhaps 
my suggested change makes no difference.


I think it will, but it will be tiny. From what I understand, even when the branch prediction guessed correctly, one receive a slight benefit if the branch is not taken.

+               memcpy(dst, src, len);
+}

With or without suggested change,
Acked-by: Morten Brørup <m...@smartsharesystems.com>


[1]: Details (incl. one VLAN tag)
tx_size_64_packets            1,1 %
tx_size_65_to_127_packets    25,7 %
tx_size_128_to_255_packets    2,6 %
tx_size_256_to_511_packets    1,4 %
tx_size_512_to_1023_packets   1,7 %
tx_size_1024_to_1522_packets 67,6 %

Reply via email to