On 12/21/2017 04:03 PM, Cong Wang wrote: > qdisc_reset() should always be called with qdisc spinlock > and with BH disabled, otherwise qdisc ->reset() could race > with TX BH. > hmm I don't see how this fixes the issue. pfifo_fast is no longer using the qdisc lock so that doesn't help. And it is only a local_bh_disable.
> Fixes: 7bbde83b1860 ("net: sched: drop qdisc_reset from dev_graft_qdisc") > Reported-by: Jakub Kicinski <jakub.kicin...@netronome.com> > Cc: John Fastabend <john.fastab...@gmail.com> > Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com> > --- > net/sched/sch_generic.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index 10aaa3b615ce..00ddb5f8f430 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -1097,8 +1097,11 @@ static void dev_qdisc_reset(struct net_device *dev, > { > struct Qdisc *qdisc = dev_queue->qdisc_sleeping; > > - if (qdisc) > + if (qdisc) { > + spin_lock_bh(qdisc_lock(qdisc)); > qdisc_reset(qdisc); > + spin_unlock_bh(qdisc_lock(qdisc)); > + } > } > > /** > OK first the cases to get to qdisc_reset that I've tracked down are, dev_shutdown() qdisc_destroy() dev_deactivate_many() dev_qdisc_reset() <- for each txq qdisc_reset() chained calls from qdisc_reset ops At the moment all the lockless qdiscs don't care about chained calls so we can ignore that, but would be nice to keep in mind. Next qdisc_reset() is doing a couple things calling the qdisc ops reset call but also walking gso_skb and skb_bad_txq. The 'unlink' operations there are not safe to be called while an enqueue/dequeue op is in-flight. Also pfifo_fast's reset op is not safe to be called with enqueue/dequeue ops in-flight. So I've made the assumption that qdisc_reset is _only_ ever called after a qdisc is no longer attached on the enqueue dev_xmit side and also any in-progress tx_action calls are completed. For what its worth this has always been the assumption AFAIK. So those are the assumptions what did I miss? The biggest gap I see is dev_deactivate_many() is supposed to wait for all tx_action calls to complete, this bit: /* Wait for outstanding qdisc_run calls. */ list_for_each_entry(dev, head, close_list) { while (some_qdisc_is_busy(dev)) yield(); Where some_qdisc_is_busy in the lockless case only does the following, if (q->flags & TCQ_F_NOLOCK) { val = test_bit(__QDISC_STATE_SCHED, &q->state); Oops that only tells us something is scheduled nothing about if something is running. Excerpt from tx_action side here clear_bit(__QDISC_STATE_SCHED, &q->state); qdisc_run(q); For comparison here is the check in the qdisc locked branch of some_qdisc_is_busy, root_lock = qdisc_lock(q); spin_lock_bh(root_lock); val = (qdisc_is_running(q) || test_bit(__QDISC_STATE_SCHED, &q->state)); spin_unlock_bh(root_lock); previously we had the qdisc_lock() to protect the bit clear and then in the some_qdisc_is_busy case we still did a qdisc_is_running() call. The easiest fix I see is add the qdisc_is_running() check to the TCQ_F_NOLOCK case in some_qdisc_is_busy AND to be correct I guess we should set the running bit before clearing the QDISC_STATE_SCHED bit to be correct. Untested but perhaps as simple as, diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index ab497ef..720829e 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -1071,7 +1071,8 @@ static bool some_qdisc_is_busy(struct net_device *dev) q = dev_queue->qdisc_sleeping; if (q->flags & TCQ_F_NOLOCK) { - val = test_bit(__QDISC_STATE_SCHED, &q->state); + val = (qdisc_is_running(q) || + test_bit(__QDISC_STATE_SCHED, &q->state)); } else { root_lock = qdisc_lock(q); spin_lock_bh(root_lock); Thanks, John