On Tue, Dec 06, 2016 at 08:31:06PM -0500, Zhihong Wang wrote: > This patch optimizes rte_memcpy for well aligned cases, where both > dst and src addr are aligned to maximum MOV width. It introduces a > dedicated function called rte_memcpy_aligned to handle the aligned > cases with simplified instruction stream. The existing rte_memcpy > is renamed as rte_memcpy_generic. The selection between them 2 is > done at the entry of rte_memcpy. > > The existing rte_memcpy is for generic cases, it handles unaligned > copies and make store aligned, it even makes load aligned for micro > architectures like Ivy Bridge. However alignment handling comes at > a price: It adds extra load/store instructions, which can cause > complications sometime. > > DPDK Vhost memcpy with Mergeable Rx Buffer feature as an example: > The copy is aligned, and remote, and there is header write along > which is also remote. In this case the memcpy instruction stream > should be simplified, to reduce extra load/store, therefore reduce > the probability of load/store buffer full caused pipeline stall, to > let the actual memcpy instructions be issued and let H/W prefetcher > goes to work as early as possible. > > This patch is tested on Ivy Bridge, Haswell and Skylake, it provides > up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging > from 64 to 1500 bytes. > > The test can also be conducted without NIC, by setting loopback > traffic between Virtio and Vhost. For example, modify the macro > TXONLY_DEF_PACKET_LEN to the requested packet size in testpmd.h, > rebuild and start testpmd in both host and guest, then "start" on > one side and "start tx_first 32" on the other. > > > Signed-off-by: Zhihong Wang <zhihong.w...@intel.com>
Reviewed-by: Yuanhan Liu <yuanhan....@linux.intel.com> --yliu