On Mon, 2019-03-11 at 16:14 +0800, Ruifeng Wang wrote: > ------------------------------------------------------------------- > --- > Improved MAC swap performance for ARM platform. > The improvement was achieved by using neon intrinsics > to save CPU cycles and doing swap for four packets > at a time. > The optimization had 15% - 20% throughput boost > in testpmd MAC swap mode. > > Signed-off-by: Ruifeng Wang <ruifeng.w...@arm.com> > Reviewed-by: Gavin Hu <gavin...@arm.com> > Reviewed-by: Phil Yang <phil.y...@arm.com> > --- > app/test-pmd/macswap.c | 4 +- > app/test-pmd/macswap_neon.h | 93 > +++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 96 insertions(+), 1 deletion(-) > create mode 100644 app/test-pmd/macswap_neon.h > > diff --git a/app/test-pmd/macswap.c b/app/test-pmd/macswap.c > > +static inline void > +do_macswap(struct rte_mbuf *pkts[], uint16_t nb, > + struct rte_port *txp) > +{ > + struct ether_hdr *eth_hdr[4]; > + struct rte_mbuf *mb[4]; > + uint64_t ol_flags; > + int i; > + int r; > + uint8x16_t v0, v1, v2, v3; > + /** > + * Index map be used to shuffle the 16 bytes. > + * byte 0-5 will be swapped with byte 6-11. > + * byte 12-15 will keep unchanged. > + */ > + uint8x16_t idx_map = {6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, > + 12, 13, 14, 15};
Nit: I think, we can make it as "const uint8x16_t idx_map". Other than that it looks good to me. Regarding the performance, I have tested with two SoCs. octeontx: +13% improvement octeontx2: +46% improvement Acked-by: Jerin Jacob <jer...@marvell.com>