On Wed, Oct 31, 2018 at 4:30 AM Tariq Toukan <tar...@mellanox.com> wrote: > > > > On 30/10/2018 1:25 AM, Eric Dumazet wrote: > > When qdisc_run() tries to use BQL budget to bulk-dequeue a batch > > of packets, GSO can later transform this list in another list > > of skbs, and each skb is sent to device ndo_start_xmit(), > > one at a time, with skb->xmit_more being set to one but > > for last skb. > > > > Problem is that very often, BQL limit is hit in the middle of > > the packet train, forcing dev_hard_start_xmit() to stop the > > bulk send and requeue the end of the list. > > > > BQL role is to avoid head of line blocking, making sure > > a qdisc can deliver high priority packets before low priority ones. > > > > But there is no way requeued packets can be bypassed by fresh > > packets in the qdisc. > > > > Aborting the bulk send increases TX softirqs, and hot cache > > lines (after skb_segment()) are wasted. > > > > Note that for TSO packets, we never split a packet in the middle > > because of BQL limit being hit. > > > > Drivers should be able to update BQL counters without > > flipping/caring about BQL status, if the current skb > > has xmit_more set. > > > > Upper layers are ultimately responsible to stop sending another > > packet train when BQL limit is hit. > > > > Code template in a driver might look like the following : > > > > if (skb->xmit_more) { > > netdev_tx_sent_queue_more(tx_queue, nr_bytes); > > send_doorbell = netif_tx_queue_stopped(tx_queue); > > } else { > > netdev_tx_sent_queue(tx_queue, nr_bytes); > > send_doorbell = true; > > } > > > > Hi Eric, > Nice optimization. > > I thought of another way of implementing it, by just extending the > existing netdev_tx_sent_queue function with a new xmit_more parameter, > that the driver passes. > Something like this: > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index d837dad24b4c..feb9cbcb5759 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -3129,12 +3129,12 @@ static inline void > netdev_txq_bql_complete_prefetchw(struct netdev_queue *dev_qu > } > > static inline void netdev_tx_sent_queue(struct netdev_queue *dev_queue, > - unsigned int bytes) > + unsigned int bytes, bool more) > { > #ifdef CONFIG_BQL > dql_queued(&dev_queue->dql, bytes); > > - if (likely(dql_avail(&dev_queue->dql) >= 0)) > + if (more || likely(dql_avail(&dev_queue->dql) >= 0)) > return; > > set_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state); > > > This unifies and simplifies both the stack and driver code, as the new > suggested function netdev_tx_sent_queue_more can become a private case > of the existing one.
Yes I thought of this, but it is too risky/invasive and I prefer adding an opt-in mechanism so that patch authors can test their patch. Then when/if all drivers are updated, we can remove the legacy stuff or rename everything. mlx4 was the only NIC I could reasonably test myself. Another suggestion from Willem is to add the following helper, returning a boolean (doorbell needed) +static inline bool __netdev_tx_sent_queue(struct netdev_queue *dev_queue, + unsigned int bytes, bool xmit_more) +{ + if (xmit_more) { + netdev_tx_sent_queue_more(ring->tx_queue, tx_info->nr_bytes); + return netif_tx_queue_stopped(dev_queue); + } else { + netdev_tx_sent_queue(dev_queue, bytes); + return true; + } +} +