On Mon, Dec 07, 2020 at 03:00:40AM +0200, Vladimir Oltean wrote: > There is a very obvious deadlock here which happens when we have > bond-over-bond and the upper calls dev_get_stats from the lower. > > Conceptually, the same can happen even in any number of stacking > combinations between bonding, net_failover, [ insert any other driver > that takes net->netdev_lists_lock here ]. > > There would be two approaches trying to solve this issue: > - using mutex_lock_nested where we aren't sure that we are top level > - ensuring through convention that user space always takes > net->netdev_lists_lock when calling dev_get_stats, and documenting > that, and therefore making it unnecessary to lock in bonding. > > I took neither of the two approaches (I don't really like either one too > much), hence [ one of ] the reasons for the RFC. Comments?
And there are also issues which are more subtle (or maybe just to me, at the time I wrote the patch). Like the fact that the netdev adjacency lists are not protected by net->netdev_lists_lock, but still by the RTNL mutex and RCU. I think that in order for the iteration through lower interfaces to capture a consistent state of the adjancency lists of all interfaces, the __netdev_adjacent_dev_link_lists and __netdev_adjacent_dev_unlink_lists functions would need to be run under the net->netdev_lists_lock, and not just under some lock per-netdev. But this is raising the locking domain covered by net->netdev_lists_lock to more than I initially intended. I'll try to do this and see how it works.