On Wed, Mar 19, 2025 at 04:08:48PM +0100, Eric Dumazet wrote: > On Wed, Mar 19, 2025 at 4:04 PM Paul E. McKenney <paul...@kernel.org> wrote: > > > On Wed, Mar 19, 2025 at 07:56:40AM -0700, Breno Leitao wrote: > > > On Wed, Mar 19, 2025 at 03:41:37PM +0100, Eric Dumazet wrote: > > > > On Wed, Mar 19, 2025 at 2:09 PM Breno Leitao <lei...@debian.org> > > wrote: > > > > > > > > > Hello, > > > > > > > > > > I am experiencing an issue with upstream kernel when compiled with > > debug > > > > > capabilities. They are CONFIG_DEBUG_NET, CONFIG_KASAN, and > > > > > CONFIG_LOCKDEP plus a few others. You can find the full > > configuration at > > > > > .... > > > > > > > > > > Basically when running a `tc replace`, it takes 13-20 seconds to > > finish: > > > > > > > > > > # time /usr/sbin/tc qdisc replace dev eth0 root handle > > 0x1234: mq > > > > > real 0m13.195s > > > > > user 0m0.001s > > > > > sys 0m2.746s > > > > > > > > > > While this is running, the machine loses network access completely. > > The > > > > > machine's network becomes inaccessible for 13 seconds above, which > > is far > > > > > from > > > > > ideal. > > > > > > > > > > Upon investigation, I found that the host is getting stuck in the > > following > > > > > call path: > > > > > > > > > > __qdisc_destroy > > > > > mq_attach > > > > > qdisc_graft > > > > > tc_modify_qdisc > > > > > rtnetlink_rcv_msg > > > > > netlink_rcv_skb > > > > > netlink_unicast > > > > > netlink_sendmsg > > > > > > > > > > The big offender here is rtnetlink_rcv_msg(), which is called with > > > > > rtnl_lock > > > > > in the follow path: > > > > > > > > > > static int tc_modify_qdisc() { > > > > > ... > > > > > netdev_lock_ops(dev); > > > > > err = __tc_modify_qdisc(skb, n, extack, dev, tca, > > tcm, > > > > > &replay); > > > > > netdev_unlock_ops(dev); > > > > > ... > > > > > } > > > > > > > > > > So, the rtnl_lock is held for 13 seconds in the case above. I also > > > > > traced that __qdisc_destroy() is called once per NIC queue, totalling > > > > > a total of 250 calls for the cards I am using. > > > > > > > > > > Ftrace output: > > > > > > > > > > # perf ftrace --graph-opts depth=100,tail,noirqs -G > > > > > rtnetlink_rcv_msg /usr/sbin/tc qdisc replace dev eth0 root handle > > 0x1: mq > > > > > | grep \\$ > > > > > 7) $ 4335849 us | } /* mq_init */ > > > > > 7) $ 4339715 us | } /* qdisc_create */ > > > > > 11) $ 15844438 us | } /* mq_attach */ > > > > > 11) $ 16129620 us | } /* qdisc_graft */ > > > > > 11) $ 20469368 us | } /* tc_modify_qdisc */ > > > > > 11) $ 20470448 us | } /* rtnetlink_rcv_msg */ > > > > > > > > > > In this case, the rtnetlink_rcv_msg() took 20 seconds, and, > > while > > > > > it > > > > > was running, the NIC was not being able to send any packet > > > > > > > > > > Going one step further, this matches what I described above: > > > > > > > > > > # perf ftrace --graph-opts depth=100,tail,noirqs -G > > > > > rtnetlink_rcv_msg /usr/sbin/tc qdisc replace dev eth0 root handle > > 0x1: mq > > > > > | grep "\\@\|\\$" > > > > > > > > > > 7) $ 4335849 us | } /* mq_init */ > > > > > 7) $ 4339715 us | } /* qdisc_create */ > > > > > 14) @ 210619.0 us | } /* schedule */ > > > > > 14) @ 210621.3 us | } /* schedule_timeout > > */ > > > > > 14) @ 210654.0 us | } /* > > > > > wait_for_completion_state */ > > > > > 14) @ 210716.7 us | } /* __wait_rcu_gp */ > > > > > 14) @ 210719.4 us | } /* synchronize_rcu_normal > > */ > > > > > 14) @ 210742.5 us | } /* synchronize_rcu */ > > > > > 14) @ 144455.7 us | } /* __qdisc_destroy */ > > > > > 14) @ 144458.6 us | } /* qdisc_put */ > > > > > <snip> > > > > > 2) @ 131083.6 us | } /* schedule */ > > > > > 2) @ 131086.5 us | } /* > > schedule_timeout */ > > > > > 2) @ 131129.6 us | } /* > > > > > wait_for_completion_state */ > > > > > 2) @ 131227.6 us | } /* __wait_rcu_gp */ > > > > > 2) @ 131231.0 us | } /* > > synchronize_rcu_normal */ > > > > > 2) @ 131242.6 us | } /* synchronize_rcu */ > > > > > 2) @ 152162.7 us | } /* __qdisc_destroy */ > > > > > 2) @ 152165.7 us | } /* qdisc_put */ > > > > > 11) $ 15844438 us | } /* mq_attach */ > > > > > 11) $ 16129620 us | } /* qdisc_graft */ > > > > > 11) $ 20469368 us | } /* tc_modify_qdisc */ > > > > > 11) $ 20470448 us | } /* rtnetlink_rcv_msg */ > > > > > > > > > > From the stack trace, it appears that most of the time is spent > > waiting > > > > > for the > > > > > RCU grace period to free the qdisc (!?): > > > > > > > > > > static void __qdisc_destroy(struct Qdisc *qdisc) > > > > > { > > > > > if (ops->destroy) > > > > > ops->destroy(qdisc); > > > > > > > > > > call_rcu(&qdisc->rcu, qdisc_free_cb); > > > > > > > > > > > > > call_rcu() is asynchronous, this is very different from > > synchronize_rcu(). > > > > > > That is a good point. The offender is synchronize_rcu() is here. > > > > Should that be synchronize_net()? > > > > I think we should redesign lockdep_unregister_key() to work on a separately > allocated piece of memory, > then use kfree_rcu() in it. > > Ie not embed a "struct lock_class_key" in the struct Qdisc, but a pointer to > > struct ... { > struct lock_class_key; > struct rcu_head rcu; > }
Works for me! Thanx, Paul > > > > > } > > > > > > > > > > So, from my newbie PoV, the issue can be summarized as follows: > > > > > > > > > > netdev_lock_ops(dev); > > > > > __tc_modify_qdisc() > > > > > qdisc_graft() > > > > > for (i = 0; i < 255; i++) > > > > > qdisc_put() > > > > > ____qdisc_destroy() > > > > > call_rcu() > > > > > } > > > > > > > > > > Questions: > > > > > > > > > > 1) I assume the egress traffic is blocked because we are modifying > > the > > > > > qdisc, which makes sense. How is this achieved? Is it related to > > > > > rtnl_lock? > > > > > > > > > > 2) Would it be beneficial to attempt qdisc_put() outside of the > > critical > > > > > section (rtnl_lock?) to prevent this freeze? > > > > > > > > > > > > > > > > > > It is unclear to me why you have syncrhonize_rcu() calls. > > > > > > This is coming from: > > > > > > __qdisc_destroy() { > > > lockdep_unregister_key(&qdisc->root_lock_key) { > > > ... > > > /* Wait until is_dynamic_key() has finished > > accessing k->hash_entry. */ > > > synchronize_rcu(); > >