On 08/24/2017 04:19 AM, Tiwei Bie wrote:
This patch adaptively batches the small guest memory copies. By batching the small copies, the efficiency of executing the memory LOAD instructions can be improved greatly, because the memory LOAD latency can be effectively hidden by the pipeline. We saw great performance boosts for small packets PVP test. This patch improves the performance for small packets, and has distinguished the packets by size. So although the performance for big packets doesn't change, it makes it relatively easy to do some special optimizations for the big packets too.
Do you mean that if we would batch unconditionnaly whatever the size, we see performance drop for larger (>256) packets? Other question is about indirect descriptors, my understanding of the patch is that the number of batched copies is limited to the queue size. In theory, we could have more than that with indirect descriptors (first indirect desc for the vnet header, second one for the packet). So in the worst case, we would have the first small copies being batched, but not the last ones if there are more than queue size. So, I think it works, but I'd like your confirmation.
Signed-off-by: Tiwei Bie <tiwei....@intel.com> Signed-off-by: Zhihong Wang <zhihong.w...@intel.com> Signed-off-by: Zhiyong Yang <zhiyong.y...@intel.com> --- This optimization depends on the CPU internal pipeline design. So further tests (e.g. ARM) from the community is appreciated.
Agree, I think this is important to have it tested on ARM platforms at least to ensure it doesn't introduce a regression. Adding Santosh, Jerin & Hemant in cc, who might know who could do the test.
lib/librte_vhost/vhost.c | 2 +- lib/librte_vhost/vhost.h | 13 +++ lib/librte_vhost/vhost_user.c | 12 +++ lib/librte_vhost/virtio_net.c | 240 ++++++++++++++++++++++++++++++++---------- 4 files changed, 209 insertions(+), 58 deletions(-)