Re: tc: network egress frozen during qdisc update with debug kernel

Cong Wang Wed, 19 Mar 2025 11:47:36 -0700

On Wed, Mar 19, 2025 at 8:08 AM Eric Dumazet <eduma...@google.com> wrote:
>
>
>
> On Wed, Mar 19, 2025 at 4:04 PM Paul E. McKenney <paul...@kernel.org> wrote:
>>
>> On Wed, Mar 19, 2025 at 07:56:40AM -0700, Breno Leitao wrote:
>> > On Wed, Mar 19, 2025 at 03:41:37PM +0100, Eric Dumazet wrote:
>> > > On Wed, Mar 19, 2025 at 2:09 PM Breno Leitao <lei...@debian.org> wrote:
>> > >
>> > > > Hello,
>> > > >
>> > > > I am experiencing an issue with upstream kernel when compiled with 
>> > > > debug
>> > > > capabilities. They are CONFIG_DEBUG_NET, CONFIG_KASAN, and
>> > > > CONFIG_LOCKDEP plus a few others. You can find the full configuration 
>> > > > at
>> > > > ....
>> > > >
>> > > > Basically when running a `tc replace`, it takes 13-20 seconds to 
>> > > > finish:
>> > > >
>> > > >         # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: 
>> > > > mq
>> > > >         real    0m13.195s
>> > > >         user    0m0.001s
>> > > >         sys     0m2.746s
>> > > >
>> > > > While this is running, the machine loses network access completely. The
>> > > > machine's network becomes inaccessible for 13 seconds above, which is 
>> > > > far
>> > > > from
>> > > > ideal.
>> > > >
>> > > > Upon investigation, I found that the host is getting stuck in the 
>> > > > following
>> > > > call path:
>> > > >
>> > > >         __qdisc_destroy
>> > > >         mq_attach
>> > > >         qdisc_graft
>> > > >         tc_modify_qdisc
>> > > >         rtnetlink_rcv_msg
>> > > >         netlink_rcv_skb
>> > > >         netlink_unicast
>> > > >         netlink_sendmsg
>> > > >
>> > > > The big offender here is rtnetlink_rcv_msg(), which is called with
>> > > > rtnl_lock
>> > > > in the follow path:
>> > > >
>> > > >         static int tc_modify_qdisc() {
>> > > >                 ...
>> > > >                 netdev_lock_ops(dev);
>> > > >                 err = __tc_modify_qdisc(skb, n, extack, dev, tca, tcm,
>> > > > &replay);
>> > > >                 netdev_unlock_ops(dev);
>> > > >                 ...
>> > > >         }
>> > > >
>> > > > So, the rtnl_lock is held for 13 seconds in the case above. I also
>> > > > traced that __qdisc_destroy() is called once per NIC queue, totalling
>> > > > a total of 250 calls for the cards I am using.
>> > > >
>> > > > Ftrace output:
>> > > >
>> > > >         # perf ftrace --graph-opts depth=100,tail,noirqs -G
>> > > > rtnetlink_rcv_msg   /usr/sbin/tc qdisc replace dev eth0 root handle 
>> > > > 0x1: mq
>> > > > | grep \\$
>> > > >         7) $ 4335849 us  |        } /* mq_init */
>> > > >         7) $ 4339715 us  |      } /* qdisc_create */
>> > > >         11) $ 15844438 us |        } /* mq_attach */
>> > > >         11) $ 16129620 us |      } /* qdisc_graft */
>> > > >         11) $ 20469368 us |    } /* tc_modify_qdisc */
>> > > >         11) $ 20470448 us |  } /* rtnetlink_rcv_msg */
>> > > >
>> > > >         In this case, the rtnetlink_rcv_msg() took 20 seconds, and, 
>> > > > while
>> > > > it
>> > > >         was running, the NIC was not being able to send any packet
>> > > >
>> > > > Going one step further, this matches what I described above:
>> > > >
>> > > >         # perf ftrace --graph-opts depth=100,tail,noirqs -G
>> > > > rtnetlink_rcv_msg   /usr/sbin/tc qdisc replace dev eth0 root handle 
>> > > > 0x1: mq
>> > > > | grep "\\@\|\\$"
>> > > >
>> > > >         7) $ 4335849 us  |        } /* mq_init */
>> > > >         7) $ 4339715 us  |      } /* qdisc_create */
>> > > >         14) @ 210619.0 us |                      } /* schedule */
>> > > >         14) @ 210621.3 us |                    } /* schedule_timeout */
>> > > >         14) @ 210654.0 us |                  } /*
>> > > > wait_for_completion_state */
>> > > >         14) @ 210716.7 us |                } /* __wait_rcu_gp */
>> > > >         14) @ 210719.4 us |              } /* synchronize_rcu_normal */
>> > > >         14) @ 210742.5 us |            } /* synchronize_rcu */
>> > > >         14) @ 144455.7 us |            } /* __qdisc_destroy */
>> > > >         14) @ 144458.6 us |          } /* qdisc_put */
>> > > >         <snip>
>> > > >         2) @ 131083.6 us |                        } /* schedule */
>> > > >         2) @ 131086.5 us |                      } /* schedule_timeout 
>> > > > */
>> > > >         2) @ 131129.6 us |                    } /*
>> > > > wait_for_completion_state */
>> > > >         2) @ 131227.6 us |                  } /* __wait_rcu_gp */
>> > > >         2) @ 131231.0 us |                } /* synchronize_rcu_normal 
>> > > > */
>> > > >         2) @ 131242.6 us |              } /* synchronize_rcu */
>> > > >         2) @ 152162.7 us |            } /* __qdisc_destroy */
>> > > >         2) @ 152165.7 us |          } /* qdisc_put */
>> > > >         11) $ 15844438 us |        } /* mq_attach */
>> > > >         11) $ 16129620 us |      } /* qdisc_graft */
>> > > >         11) $ 20469368 us |    } /* tc_modify_qdisc */
>> > > >         11) $ 20470448 us |  } /* rtnetlink_rcv_msg */
>> > > >
>> > > > From the stack trace, it appears that most of the time is spent waiting
>> > > > for the
>> > > > RCU grace period to free the qdisc (!?):
>> > > >
>> > > >         static void __qdisc_destroy(struct Qdisc *qdisc)
>> > > >         {
>> > > >                 if (ops->destroy)
>> > > >                         ops->destroy(qdisc);
>> > > >
>> > > >                 call_rcu(&qdisc->rcu, qdisc_free_cb);
>> > > >
>> > >
>> > > call_rcu() is asynchronous, this is very different from 
>> > > synchronize_rcu().
>> >
>> > That is a good point. The offender is synchronize_rcu() is here.
>>
>> Should that be synchronize_net()?
>
>
> I think we should redesign lockdep_unregister_key() to work on a separately 
> allocated piece of memory,
> then use kfree_rcu() in it.
>
> Ie not embed a "struct lock_class_key" in the struct Qdisc, but a pointer to


Lockdep requires the key object must be static:

 822 /*
 823  * Is this the address of a static object:
 824  */
 825 #ifdef __KERNEL__
 826 static int static_obj(const void *obj)
 827 {
 828         unsigned long addr = (unsigned long) obj;
 829
 830         if (is_kernel_core_data(addr))
 831                 return 1;
 832
 833         /*
 834          * keys are allowed in the __ro_after_init section.
 835          */
 836         if (is_kernel_rodata(addr))
 837                 return 1;
 838

I am afraid the best suggestion here would be just disabling LOCKDEP,
which is known for big overhead.

Thanks.

Re: tc: network egress frozen during qdisc update with debug kernel

Reply via email to