On 04/18/2018 12:28 AM, Paolo Abeni wrote: > Hi, > > let me revive this old thread... > > On Mon, 2018-03-26 at 11:16 -0700, John Fastabend wrote: >> On 03/26/2018 10:30 AM, Cong Wang wrote: >>> On Sat, Mar 24, 2018 at 10:25 PM, John Fastabend >>> <john.fastab...@gmail.com> wrote: >>>> After the qdisc lock was dropped in pfifo_fast we allow multiple >>>> enqueue threads and dequeue threads to run in parallel. On the >>>> enqueue side the skb bit ooo_okay is used to ensure all related >>>> skbs are enqueued in-order. On the dequeue side though there is >>>> no similar logic. What we observe is with fewer queues than CPUs >>>> it is possible to re-order packets when two instances of >>>> __qdisc_run() are running in parallel. Each thread will dequeue >>>> a skb and then whichever thread calls the ndo op first will >>>> be sent on the wire. This doesn't typically happen because >>>> qdisc_run() is usually triggered by the same core that did the >>>> enqueue. However, drivers will trigger __netif_schedule() >>>> when queues are transitioning from stopped to awake using the >>>> netif_tx_wake_* APIs. When this happens netif_schedule() calls >>>> qdisc_run() on the same CPU that did the netif_tx_wake_* which >>>> is usually done in the interrupt completion context. This CPU >>>> is selected with the irq affinity which is unrelated to the >>>> enqueue operations. >>> >>> Interesting. Why this is unique to pfifo_fast? For me it could >>> happen to other qdisc's too, when we release the qdisc root >>> lock in sch_direct_xmit(), another CPU could dequeue from >>> the same qdisc and transmit the skb in parallel too? >>> >> >> Agreed, my guess is it never happens because the timing is >> tighter in the lock case. Or if it is happening its infrequent >> enough that no one noticed the OOO packets. > > I think the above could not happend due to the qdisc seqlock - which is > not acquired by NOLOCK qdiscs. >
Yep, seems to be the case. >> For net-next we probably could clean this up. I was just >> going for something simple in net that didn't penalize all >> qdiscs as Eric noted. This patch doesn't make it any worse >> at least. And we have been living with the above race for >> years. > > I've benchmarked this patch is some different scenario, and in my > testing it introduces a measurable regression in uncontended/lightly > contended scenarios. The measured peak negative delta is with a pktgen > thread using "xmit_mode queue_xmit": > > before: 27674032 pps > after: 23809052 pps Yeah more atomic ops :/ > > I spend some time searching a way to improve this, without success. > > John, did you had any chance to look at this again? > If we have a multiple cores pulling from the same skb list and feeding the same txq this happens. One problem is even if the normal dev_queue_xmit path is aligned drivers call netif_schedule from interrupt context and that happens on an arbitrary a cpu. When the arbitrary cpu runs the netif_schedule logic it will dequeue from the skb list using the cpu it was scheduled on. The lockless case is not _really_ lockless after this patch we have managed to pull apart the enqueue and dequeue serialization though. Thanks for bringing this up. I'll think about it for a bit maybe there is something we can do here. There is a set of conditions that if met we can run without the lock. Possibly ONETXQUEUE and aligned cpu_map is sufficient. We could detect this case and drop the locking. For existing systems and high Gbps NICs I think (feel free to correct me) assuming a core per cpu is OK. At some point though we probably need to revisit this assumption. .John > Thanks, > > Paolo >