[PATCH] net/vhost: report TX errors in port stats

2023-09-29 Thread Andrey Ignatov
vhost device doesn't report TX errors what complicates debugging of
dropped packets. Add oerrors to port stats.

- before (testpmd `show port stats`):
  TX-packets: 18328512   TX-errors: 0  TX-bytes:  1173024768

- after:
  TX-packets: 1737728TX-errors: 131367616  TX-bytes:  111214592

Signed-off-by: Andrey Ignatov 
---
 drivers/net/vhost/rte_eth_vhost.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/vhost/rte_eth_vhost.c 
b/drivers/net/vhost/rte_eth_vhost.c
index 8d37ec9775..48a9a79efe 100644
--- a/drivers/net/vhost/rte_eth_vhost.c
+++ b/drivers/net/vhost/rte_eth_vhost.c
@@ -1299,6 +1299,7 @@ eth_stats_get(struct rte_eth_dev *dev, struct 
rte_eth_stats *stats)
unsigned i;
unsigned long rx_total = 0, tx_total = 0;
unsigned long rx_total_bytes = 0, tx_total_bytes = 0;
+   unsigned long tx_total_errors = 0;
struct vhost_queue *vq;
 
for (i = 0; i < RTE_ETHDEV_QUEUE_STAT_CNTRS &&
@@ -1323,12 +1324,15 @@ eth_stats_get(struct rte_eth_dev *dev, struct 
rte_eth_stats *stats)
 
stats->q_obytes[i] = vq->stats.bytes;
tx_total_bytes += stats->q_obytes[i];
+
+   tx_total_errors += vq->stats.missed_pkts;
}
 
stats->ipackets = rx_total;
stats->opackets = tx_total;
stats->ibytes = rx_total_bytes;
stats->obytes = tx_total_bytes;
+   stats->oerrors = tx_total_errors;
 
return 0;
 }
-- 
2.37.3



[PATCH] vhost: optimize mbuf allocation in virtio Tx packed path

2024-03-28 Thread Andrey Ignatov
Currently virtio_dev_tx_packed() always allocates requested @count of
packets, no matter how many packets are really available on the virtio
Tx ring. Later it has to free all packets it didn't use and if, for
example, there were zero available packets on the ring, then all @count
mbufs would be allocated just to be freed afterwards.

This wastes CPU cycles since rte_pktmbuf_alloc_bulk() /
rte_pktmbuf_free_bulk() do quite a lot of work.

Optimize it by using the same idea as the virtio_dev_tx_split() uses on
the Tx split path: estimate the number of available entries on the ring
and allocate only that number of mbufs.

On the split path it's pretty easy to estimate.

On the packed path it's more work since it requires checking flags for
up to @count of descriptors. Still it's much less expensive than the
alloc/free pair.

The new get_nb_avail_entries_packed() function doesn't change how
virtio_dev_tx_packed() works with regard to memory barriers since the
barrier between checking flags and other descriptor fields is still in
place later in virtio_dev_tx_batch_packed() and
virtio_dev_tx_single_packed().

The difference for a guest transmitting ~17Gbps with MTU 1500 on a `perf
record` / `perf report` (on lower pps the savings will be bigger):

* Before the change:

Samples: 18K of event 'cycles:P', Event count (approx.): 19206831288
  Children  Self  Pid:Command
-  100.00%   100.00%   798808:dpdk-worker1
<... skip ...>
- 99.09% pkt_burst_io_forward
   - 90.26% common_fwd_stream_receive
  - 90.04% rte_eth_rx_burst
 - 75.53% eth_vhost_rx
- 74.29% rte_vhost_dequeue_burst
   - 71.48% virtio_dev_tx_packed_compliant
  + 17.11% rte_pktmbuf_alloc_bulk
  + 11.80% rte_pktmbuf_free_bulk
  + 2.11% vhost_user_inject_irq
0.75% rte_pktmbuf_reset
0.53% __rte_pktmbuf_free_seg_via_array
 0.88% vhost_queue_stats_update
 + 13.66% mlx5_rx_burst_vec
   + 8.69% common_fwd_stream_transmit

* After:

Samples: 18K of event 'cycles:P', Event count (approx.): 19225310840
  Children  Self  Pid:Command
-  100.00%   100.00%   859754:dpdk-worker1
<... skip ...>
- 98.61% pkt_burst_io_forward
   - 86.29% common_fwd_stream_receive
  - 85.84% rte_eth_rx_burst
 - 61.94% eth_vhost_rx
- 60.05% rte_vhost_dequeue_burst
   - 55.98% virtio_dev_tx_packed_compliant
  + 3.43% rte_pktmbuf_alloc_bulk
  + 2.50% vhost_user_inject_irq
 1.17% vhost_queue_stats_update
 0.76% rte_rwlock_read_unlock
 0.54% rte_rwlock_read_trylock
 + 22.21% mlx5_rx_burst_vec
   + 12.00% common_fwd_stream_transmit

It can be seen that virtio_dev_tx_packed_compliant() goes from 71.48% to
55.98% with rte_pktmbuf_alloc_bulk() going from 17.11% to 3.43% and
rte_pktmbuf_free_bulk() going away completely.

Signed-off-by: Andrey Ignatov 
---
 lib/vhost/virtio_net.c | 33 +
 1 file changed, 33 insertions(+)

diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
index 1359c5fb1f..b406b5d7d9 100644
--- a/lib/vhost/virtio_net.c
+++ b/lib/vhost/virtio_net.c
@@ -3484,6 +3484,35 @@ virtio_dev_tx_single_packed(struct virtio_net *dev,
return ret;
 }
 
+static __rte_always_inline uint16_t
+get_nb_avail_entries_packed(const struct vhost_virtqueue *__rte_restrict vq,
+   uint16_t max_nb_avail_entries)
+{
+   const struct vring_packed_desc *descs = vq->desc_packed;
+   bool avail_wrap = vq->avail_wrap_counter;
+   uint16_t avail_idx = vq->last_avail_idx;
+   uint16_t nb_avail_entries = 0;
+   uint16_t flags;
+
+   while (nb_avail_entries < max_nb_avail_entries) {
+   flags = descs[avail_idx].flags;
+
+   if ((avail_wrap != !!(flags & VRING_DESC_F_AVAIL)) ||
+   (avail_wrap == !!(flags & VRING_DESC_F_USED)))
+   return nb_avail_entries;
+
+   if (!(flags & VRING_DESC_F_NEXT))
+   ++nb_avail_entries;
+
+   if (unlikely(++avail_idx >= vq->size)) {
+   avail_idx -= vq->size;
+   avail_wrap = !avail_wrap;
+   }
+   }
+
+   return nb_avail_entries;
+}
+
 __rte_always_inline
 static uint16_t
 virti

Re: [PATCH] vhost: optimize mbuf allocation in virtio Tx packed path

2024-03-28 Thread Andrey Ignatov
On Thu, Mar 28, 2024 at 04:44:26PM -0700, Stephen Hemminger wrote:
> On Thu, 28 Mar 2024 16:33:38 -0700
> Andrey Ignatov  wrote:
> 
> >  
> > +static __rte_always_inline uint16_t
> > +get_nb_avail_entries_packed(const struct vhost_virtqueue *__rte_restrict 
> > vq,
> > +   uint16_t max_nb_avail_entries)
> > +{
> 
> You don't need always inline, the compiler will do it anyway.

I can remove it in v2, but it's not completely obvious to me how is it
decided when to specify it explicitly and when not?

I see plenty of __rte_always_inline in this file:

% git grep -c '^static __rte_always_inline' lib/vhost/virtio_net.c
lib/vhost/virtio_net.c:66