> -----Original Message----- > From: David Miller <da...@davemloft.net> > Sent: Thursday, November 8, 2018 8:08 AM > To: Ioana Ciocoi Radulescu <ruxandra.radule...@nxp.com> > Cc: netdev@vger.kernel.org; Ioana Ciornei <ioana.cior...@nxp.com> > Subject: Re: [PATCH net-next] dpaa2-eth: Introduce TX congestion > management > > From: Ioana Ciocoi Radulescu <ruxandra.radule...@nxp.com> > Date: Wed, 7 Nov 2018 10:31:16 +0000 > > > We chose this mechanism over BQL (to which it is conceptually > > very similar) because a) we can take advantage of the hardware > > offloading and b) BQL doesn't match well with our driver fastpath > > (we process ingress (Rx or Tx conf) frames in batches of up to 16, > > which in certain scenarios confuses the BQL adaptive algorithm, > > resulting in too low values of the limit and low performance). > > First, this kind of explanation belongs in the commit message. > > Second, you'll have to describe better what BQL, which is the > ultimate standard mechanism for every single driver in the > kernel to deal with this issue. > > Are you saying that if 15 TX frames are pending, not TX interrupt > will arrive at all?
I meant up to 16, not exactly 16. > > There absolutely must be some timeout or similar interrupt that gets > sent in that kind of situation. You cannot leave stale TX packets > on your ring unprocessed just because a non-multiple of 16 packets > were queued up and then TX activity stopped. I'll try to detail a bit how our hardware works, since it's not the standard ring-based architecture. In our driver implementation, we have multiple ingress queues (for Rx frames and Tx confirmation frames) grouped into core-affine channels. Each channel may contain one or more queues of each type. When queues inside a channel have available frames, the hardware triggers a notification for us; we then issue a dequeue command for that channel, in response to which hardware will write a number of frames (between 1 and 16) at a memory location of our choice. Frames obtained as a result of one dequeue operation are all from the same queue, but consecutive dequeues on the same channel may service different queues. So my initial attempt at implementing BQL called netdev_tx_completed_queue() for each batch of Tx confirmation frames obtained from a single dequeue operation. I don't fully understand the BQL algorithm yet, but I think it had a problem with the fact that we never reported as completed more than 16 frames at a time. Consider a scenario where a DPAA2 platform does IP forwarding from port1 to port2; to keep things simple, let's say we're single core and have only one Rx and one Tx confirmation queue per interface. Without BQL, the egress interface can transmit up to 64 frames at a time (one for each Rx frame processed by the ingress interface in its NAPI poll) and then process all 64 confirmations (grouped in 4 dequeues) during its own NAPI cycle. With BQL support added, the number of consecutive transmitted frames is never higher than 33; after that, BQL stops the netdev tx queue until we mark some of those trasmits as completed. This results in a performance drop of over 30%. Today I tried to further coalesce the confirmation frames such that I call netdev_tx_completed_queue() only at the end of the NAPI poll, once for each confirmation queue that was serviced during that NAPI. I need to do more testing, but so far it performs *almost* on par with the non-BQL driver version. But it does complicate the fastpath code and feels somewhat unnatural. So I would still prefer using the hardware mechanism, which I think fits more smoothly with both our hardware and driver implementation, and adds less overhead. But if you feel strongly about it I can drop this patch and polish the BQL support until it's looks decent enough (and hopefully doesn't mess too much with our performance benchmarks). Thanks, Ioana