On Wed, Mar 19, 2025 at 07:56:40AM -0700, Breno Leitao wrote: > On Wed, Mar 19, 2025 at 03:41:37PM +0100, Eric Dumazet wrote: > > On Wed, Mar 19, 2025 at 2:09 PM Breno Leitao <lei...@debian.org> wrote: > > > > > Hello, > > > > > > I am experiencing an issue with upstream kernel when compiled with debug > > > capabilities. They are CONFIG_DEBUG_NET, CONFIG_KASAN, and > > > CONFIG_LOCKDEP plus a few others. You can find the full configuration at > > > .... > > > > > > Basically when running a `tc replace`, it takes 13-20 seconds to finish: > > > > > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: mq > > > real 0m13.195s > > > user 0m0.001s > > > sys 0m2.746s > > > > > > While this is running, the machine loses network access completely. The > > > machine's network becomes inaccessible for 13 seconds above, which is far > > > from > > > ideal. > > > > > > Upon investigation, I found that the host is getting stuck in the > > > following > > > call path: > > > > > > __qdisc_destroy > > > mq_attach > > > qdisc_graft > > > tc_modify_qdisc > > > rtnetlink_rcv_msg > > > netlink_rcv_skb > > > netlink_unicast > > > netlink_sendmsg > > > > > > The big offender here is rtnetlink_rcv_msg(), which is called with > > > rtnl_lock > > > in the follow path: > > > > > > static int tc_modify_qdisc() { > > > ... > > > netdev_lock_ops(dev); > > > err = __tc_modify_qdisc(skb, n, extack, dev, tca, tcm, > > > &replay); > > > netdev_unlock_ops(dev); > > > ... > > > } > > > > > > So, the rtnl_lock is held for 13 seconds in the case above. I also > > > traced that __qdisc_destroy() is called once per NIC queue, totalling > > > a total of 250 calls for the cards I am using. > > > > > > Ftrace output: > > > > > > # perf ftrace --graph-opts depth=100,tail,noirqs -G > > > rtnetlink_rcv_msg /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: > > > mq > > > | grep \\$ > > > 7) $ 4335849 us | } /* mq_init */ > > > 7) $ 4339715 us | } /* qdisc_create */ > > > 11) $ 15844438 us | } /* mq_attach */ > > > 11) $ 16129620 us | } /* qdisc_graft */ > > > 11) $ 20469368 us | } /* tc_modify_qdisc */ > > > 11) $ 20470448 us | } /* rtnetlink_rcv_msg */ > > > > > > In this case, the rtnetlink_rcv_msg() took 20 seconds, and, while > > > it > > > was running, the NIC was not being able to send any packet > > > > > > Going one step further, this matches what I described above: > > > > > > # perf ftrace --graph-opts depth=100,tail,noirqs -G > > > rtnetlink_rcv_msg /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: > > > mq > > > | grep "\\@\|\\$" > > > > > > 7) $ 4335849 us | } /* mq_init */ > > > 7) $ 4339715 us | } /* qdisc_create */ > > > 14) @ 210619.0 us | } /* schedule */ > > > 14) @ 210621.3 us | } /* schedule_timeout */ > > > 14) @ 210654.0 us | } /* > > > wait_for_completion_state */ > > > 14) @ 210716.7 us | } /* __wait_rcu_gp */ > > > 14) @ 210719.4 us | } /* synchronize_rcu_normal */ > > > 14) @ 210742.5 us | } /* synchronize_rcu */ > > > 14) @ 144455.7 us | } /* __qdisc_destroy */ > > > 14) @ 144458.6 us | } /* qdisc_put */ > > > <snip> > > > 2) @ 131083.6 us | } /* schedule */ > > > 2) @ 131086.5 us | } /* schedule_timeout */ > > > 2) @ 131129.6 us | } /* > > > wait_for_completion_state */ > > > 2) @ 131227.6 us | } /* __wait_rcu_gp */ > > > 2) @ 131231.0 us | } /* synchronize_rcu_normal */ > > > 2) @ 131242.6 us | } /* synchronize_rcu */ > > > 2) @ 152162.7 us | } /* __qdisc_destroy */ > > > 2) @ 152165.7 us | } /* qdisc_put */ > > > 11) $ 15844438 us | } /* mq_attach */ > > > 11) $ 16129620 us | } /* qdisc_graft */ > > > 11) $ 20469368 us | } /* tc_modify_qdisc */ > > > 11) $ 20470448 us | } /* rtnetlink_rcv_msg */ > > > > > > From the stack trace, it appears that most of the time is spent waiting > > > for the > > > RCU grace period to free the qdisc (!?): > > > > > > static void __qdisc_destroy(struct Qdisc *qdisc) > > > { > > > if (ops->destroy) > > > ops->destroy(qdisc); > > > > > > call_rcu(&qdisc->rcu, qdisc_free_cb); > > > > > > > call_rcu() is asynchronous, this is very different from synchronize_rcu(). > > That is a good point. The offender is synchronize_rcu() is here.
Should that be synchronize_net()? Thanx, Paul > > > } > > > > > > So, from my newbie PoV, the issue can be summarized as follows: > > > > > > netdev_lock_ops(dev); > > > __tc_modify_qdisc() > > > qdisc_graft() > > > for (i = 0; i < 255; i++) > > > qdisc_put() > > > ____qdisc_destroy() > > > call_rcu() > > > } > > > > > > Questions: > > > > > > 1) I assume the egress traffic is blocked because we are modifying the > > > qdisc, which makes sense. How is this achieved? Is it related to > > > rtnl_lock? > > > > > > 2) Would it be beneficial to attempt qdisc_put() outside of the critical > > > section (rtnl_lock?) to prevent this freeze? > > > > > > > > > > It is unclear to me why you have syncrhonize_rcu() calls. > > This is coming from: > > __qdisc_destroy() { > lockdep_unregister_key(&qdisc->root_lock_key) { > ... > /* Wait until is_dynamic_key() has finished accessing > k->hash_entry. */ > synchronize_rcu();