On Wed, Mar 19, 2025 at 8:08 AM Eric Dumazet <eduma...@google.com> wrote: > > > > On Wed, Mar 19, 2025 at 4:04 PM Paul E. McKenney <paul...@kernel.org> wrote: >> >> On Wed, Mar 19, 2025 at 07:56:40AM -0700, Breno Leitao wrote: >> > On Wed, Mar 19, 2025 at 03:41:37PM +0100, Eric Dumazet wrote: >> > > On Wed, Mar 19, 2025 at 2:09 PM Breno Leitao <lei...@debian.org> wrote: >> > > >> > > > Hello, >> > > > >> > > > I am experiencing an issue with upstream kernel when compiled with >> > > > debug >> > > > capabilities. They are CONFIG_DEBUG_NET, CONFIG_KASAN, and >> > > > CONFIG_LOCKDEP plus a few others. You can find the full configuration >> > > > at >> > > > .... >> > > > >> > > > Basically when running a `tc replace`, it takes 13-20 seconds to >> > > > finish: >> > > > >> > > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: >> > > > mq >> > > > real 0m13.195s >> > > > user 0m0.001s >> > > > sys 0m2.746s >> > > > >> > > > While this is running, the machine loses network access completely. The >> > > > machine's network becomes inaccessible for 13 seconds above, which is >> > > > far >> > > > from >> > > > ideal. >> > > > >> > > > Upon investigation, I found that the host is getting stuck in the >> > > > following >> > > > call path: >> > > > >> > > > __qdisc_destroy >> > > > mq_attach >> > > > qdisc_graft >> > > > tc_modify_qdisc >> > > > rtnetlink_rcv_msg >> > > > netlink_rcv_skb >> > > > netlink_unicast >> > > > netlink_sendmsg >> > > > >> > > > The big offender here is rtnetlink_rcv_msg(), which is called with >> > > > rtnl_lock >> > > > in the follow path: >> > > > >> > > > static int tc_modify_qdisc() { >> > > > ... >> > > > netdev_lock_ops(dev); >> > > > err = __tc_modify_qdisc(skb, n, extack, dev, tca, tcm, >> > > > &replay); >> > > > netdev_unlock_ops(dev); >> > > > ... >> > > > } >> > > > >> > > > So, the rtnl_lock is held for 13 seconds in the case above. I also >> > > > traced that __qdisc_destroy() is called once per NIC queue, totalling >> > > > a total of 250 calls for the cards I am using. >> > > > >> > > > Ftrace output: >> > > > >> > > > # perf ftrace --graph-opts depth=100,tail,noirqs -G >> > > > rtnetlink_rcv_msg /usr/sbin/tc qdisc replace dev eth0 root handle >> > > > 0x1: mq >> > > > | grep \\$ >> > > > 7) $ 4335849 us | } /* mq_init */ >> > > > 7) $ 4339715 us | } /* qdisc_create */ >> > > > 11) $ 15844438 us | } /* mq_attach */ >> > > > 11) $ 16129620 us | } /* qdisc_graft */ >> > > > 11) $ 20469368 us | } /* tc_modify_qdisc */ >> > > > 11) $ 20470448 us | } /* rtnetlink_rcv_msg */ >> > > > >> > > > In this case, the rtnetlink_rcv_msg() took 20 seconds, and, >> > > > while >> > > > it >> > > > was running, the NIC was not being able to send any packet >> > > > >> > > > Going one step further, this matches what I described above: >> > > > >> > > > # perf ftrace --graph-opts depth=100,tail,noirqs -G >> > > > rtnetlink_rcv_msg /usr/sbin/tc qdisc replace dev eth0 root handle >> > > > 0x1: mq >> > > > | grep "\\@\|\\$" >> > > > >> > > > 7) $ 4335849 us | } /* mq_init */ >> > > > 7) $ 4339715 us | } /* qdisc_create */ >> > > > 14) @ 210619.0 us | } /* schedule */ >> > > > 14) @ 210621.3 us | } /* schedule_timeout */ >> > > > 14) @ 210654.0 us | } /* >> > > > wait_for_completion_state */ >> > > > 14) @ 210716.7 us | } /* __wait_rcu_gp */ >> > > > 14) @ 210719.4 us | } /* synchronize_rcu_normal */ >> > > > 14) @ 210742.5 us | } /* synchronize_rcu */ >> > > > 14) @ 144455.7 us | } /* __qdisc_destroy */ >> > > > 14) @ 144458.6 us | } /* qdisc_put */ >> > > > <snip> >> > > > 2) @ 131083.6 us | } /* schedule */ >> > > > 2) @ 131086.5 us | } /* schedule_timeout >> > > > */ >> > > > 2) @ 131129.6 us | } /* >> > > > wait_for_completion_state */ >> > > > 2) @ 131227.6 us | } /* __wait_rcu_gp */ >> > > > 2) @ 131231.0 us | } /* synchronize_rcu_normal >> > > > */ >> > > > 2) @ 131242.6 us | } /* synchronize_rcu */ >> > > > 2) @ 152162.7 us | } /* __qdisc_destroy */ >> > > > 2) @ 152165.7 us | } /* qdisc_put */ >> > > > 11) $ 15844438 us | } /* mq_attach */ >> > > > 11) $ 16129620 us | } /* qdisc_graft */ >> > > > 11) $ 20469368 us | } /* tc_modify_qdisc */ >> > > > 11) $ 20470448 us | } /* rtnetlink_rcv_msg */ >> > > > >> > > > From the stack trace, it appears that most of the time is spent waiting >> > > > for the >> > > > RCU grace period to free the qdisc (!?): >> > > > >> > > > static void __qdisc_destroy(struct Qdisc *qdisc) >> > > > { >> > > > if (ops->destroy) >> > > > ops->destroy(qdisc); >> > > > >> > > > call_rcu(&qdisc->rcu, qdisc_free_cb); >> > > > >> > > >> > > call_rcu() is asynchronous, this is very different from >> > > synchronize_rcu(). >> > >> > That is a good point. The offender is synchronize_rcu() is here. >> >> Should that be synchronize_net()? > > > I think we should redesign lockdep_unregister_key() to work on a separately > allocated piece of memory, > then use kfree_rcu() in it. > > Ie not embed a "struct lock_class_key" in the struct Qdisc, but a pointer to
Lockdep requires the key object must be static: 822 /* 823 * Is this the address of a static object: 824 */ 825 #ifdef __KERNEL__ 826 static int static_obj(const void *obj) 827 { 828 unsigned long addr = (unsigned long) obj; 829 830 if (is_kernel_core_data(addr)) 831 return 1; 832 833 /* 834 * keys are allowed in the __ro_after_init section. 835 */ 836 if (is_kernel_rodata(addr)) 837 return 1; 838 I am afraid the best suggestion here would be just disabling LOCKDEP, which is known for big overhead. Thanks.