On 05/31/2018 11:55 AM, Wang, Zhihong wrote:


-----Original Message-----
From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
Sent: Tuesday, May 29, 2018 5:45 PM
To: dev@dpdk.org; Bie, Tiwei <tiwei....@intel.com>; Wang, Zhihong
<zhihong.w...@intel.com>
Cc: Maxime Coquelin <maxime.coque...@redhat.com>
Subject: [PATCH v2 0/2] Vhost: unitfy receive paths

Hi,

This second version fixes the feature bit check in
rxvq_is_mergeable(), and remove "mergeable" from rx funcs
names. No difference is seen in the benchmarks

This series is preliminary work to ease the integration of
packed ring layout support. But even without packed ring
layout, the result is positive.

First patch unify both paths, and second one is a small
optimization to avoid copying batch_copy_nb_elems VQ field
to/from the stack.

With the series applied, I get modest performance gain for
both mergeable and non-mergeable casesi (, and the gain of
about 300 LoC is non negligible maintenance-wise.

Rx-mrg=off benchmarks:

+------------+-------+-------------+-------------+----------+
|    Run     |  PVP  | Guest->Host | Host->Guest | Loopback |
+------------+-------+-------------+-------------+----------+
| v18.05-rc5 | 14.47 |       16.64 |       17.57 |    13.15 |
| + series   | 14.87 |       16.86 |       17.70 |    13.30 |
+------------+-------+-------------+-------------+----------+

Rx-mrg=on benchmarks:

+------------+------+-------------+-------------+----------+
|    Run     | PVP  | Guest->Host | Host->Guest | Loopback |
+------------+------+-------------+-------------+----------+
| v18.05-rc5 | 9.38 |       13.78 |       16.70 |    12.79 |
| + series   | 9.38 |       13.80 |       17.49 |    13.36 |
+------------+------+-------------+-------------+----------+

Note: Even without my series, the guest->host benchmark with
mergeable buffers enabled looks suspicious as it should in
theory be alsmost identical as when Rx mergeable buffers are
disabled. To be investigated...

Maxime Coquelin (2):
   vhost: unify Rx mergeable and non-mergeable paths
   vhost: improve batched copies performance

  lib/librte_vhost/virtio_net.c | 376 +++++-------------------------------------
  1 file changed, 37 insertions(+), 339 deletions(-)


Acked-by: Zhihong Wang <zhihong.w...@intel.com>

Thanks Maxime! This is really great to see. ;) We probably need the
same improvement for Virtio-pmd.

Yes, probably. I'll have a look at it, or if you have time to look at
it, won't blame you! :)

One comment on Virtio/Vhost performance analysis: No matter what type
of traffic is used (PVP, or Txonly-Rxonly, Loopback...), we need to
be clear on who we're testing, and give the other part excessive CPU
resources, otherwise we'll be testing whoever the slowest.

Since this patch is for Vhost, I suggest to run N (e.g. N = 4) Virtio
threads on N cores, and the corresponding N Vhost threads on a single
core, to do performance comparison. Do you think this makes sense?

That's a valid point. I'll try this to get the bottleneck.
I'm in the process of setting up an automated test bench, it will help
running more and more test cases.

For Guest -> Host, in my test I see Rx-mrg=on has negative impact on
Virtio side, probably because Virtio touches something that's not
touched when Rx-mrg=off.

I get it now.
When mrg=off, we use simple_tx version whereas we use the full one when
mrg is off:

static int
virtio_dev_configure(struct rte_eth_dev *dev)
{
...
        hw->use_simple_rx = 1;
        hw->use_simple_tx = 1;

#if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM
        if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
                hw->use_simple_rx = 0;
                hw->use_simple_tx = 0;
        }
#endif
        if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
                hw->use_simple_rx = 0;
                hw->use_simple_tx = 0;
        }

        if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
                           DEV_RX_OFFLOAD_TCP_CKSUM))
                hw->use_simple_rx = 0;

        return 0;
}

I see two problems here:
1. There should be no reasons not to use simple_tx if mrg is on.
2. We should add test on whether rx and tx offloads have been
negotiated to not use simple versions if it has been.

Do you agree with that proposed changes?
I'll post a RFC for this.

Thanks,
Maxime

Thanks
-Zhihong

Reply via email to