Interleaving SSE SIMD load, shuffle, and store, helps to improve the overall mac-swapp Mpps for both RX and TX.
Test Result: * Platform: AMD EPYC 9554 @3.1GHz, no boost * Test scenarios: TEST-PMD 64B IO vs MAC-SWAP * NIC: broadcom P2100: loopback 2*100Gbps <mode : Mpps Ingress: Mpps Egress> ------------------------------------------------ - MAC-SWAP original: 45.75 : 43.8 - MAC-SWAP register mod: 45.73 : 44.83 - MAC-SWAP register+ofl mod: 46.36 : 44.79 - MAC-SWAP register+ofl+interleave mod: 46.0 : 45.1 Signed-off-by: Vipin Varghese <vipin.vargh...@amd.com> --- app/test-pmd/macswap_sse.h | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/app/test-pmd/macswap_sse.h b/app/test-pmd/macswap_sse.h index 67ff7fdfbb..1f547388b7 100644 --- a/app/test-pmd/macswap_sse.h +++ b/app/test-pmd/macswap_sse.h @@ -52,23 +52,25 @@ do_macswap(struct rte_mbuf *pkts[], uint16_t nb, addr1 = _mm_loadu_si128((__m128i *)eth_hdr[1]); mbuf_field_set(mb[1], ol_flags); + addr0 = _mm_shuffle_epi8(addr0, shfl_msk); + mb[2] = pkts[i++]; eth_hdr[2] = rte_pktmbuf_mtod(mb[2], struct rte_ether_hdr *); addr2 = _mm_loadu_si128((__m128i *)eth_hdr[2]); mbuf_field_set(mb[2], ol_flags); + addr1 = _mm_shuffle_epi8(addr1, shfl_msk); + _mm_storeu_si128((__m128i *)eth_hdr[0], addr0); + mb[3] = pkts[i++]; eth_hdr[3] = rte_pktmbuf_mtod(mb[3], struct rte_ether_hdr *); addr3 = _mm_loadu_si128((__m128i *)eth_hdr[3]); mbuf_field_set(mb[3], ol_flags); - addr0 = _mm_shuffle_epi8(addr0, shfl_msk); - addr1 = _mm_shuffle_epi8(addr1, shfl_msk); addr2 = _mm_shuffle_epi8(addr2, shfl_msk); - addr3 = _mm_shuffle_epi8(addr3, shfl_msk); - - _mm_storeu_si128((__m128i *)eth_hdr[0], addr0); _mm_storeu_si128((__m128i *)eth_hdr[1], addr1); + + addr3 = _mm_shuffle_epi8(addr3, shfl_msk); _mm_storeu_si128((__m128i *)eth_hdr[2], addr2); _mm_storeu_si128((__m128i *)eth_hdr[3], addr3); -- 2.34.1