Quoting Saeed Mahameed (2021-03-12 21:54:18) > On Fri, 2021-03-12 at 16:04 +0100, Antoine Tenart wrote: > > netif_set_xps_queue must be called with the rtnl lock taken, and this > > is > > now enforced using ASSERT_RTNL(). mlx5e_attach_netdev was taking the > > lock conditionally, fix this by taking the rtnl lock all the time. > > > > Signed-off-by: Antoine Tenart <aten...@kernel.org> > > --- > > drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 11 +++-------- > > 1 file changed, 3 insertions(+), 8 deletions(-) > > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > > b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > > index ec2fcb2a2977..96cba86b9f0d 100644 > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > > @@ -5557,7 +5557,6 @@ static void mlx5e_update_features(struct > > net_device *netdev) > > > > int mlx5e_attach_netdev(struct mlx5e_priv *priv) > > { > > - const bool take_rtnl = priv->netdev->reg_state == > > NETREG_REGISTERED; > > const struct mlx5e_profile *profile = priv->profile; > > int max_nch; > > int err; > > @@ -5578,15 +5577,11 @@ int mlx5e_attach_netdev(struct mlx5e_priv > > *priv) > > * 2. Set our default XPS cpumask. > > * 3. Build the RQT. > > * > > - * rtnl_lock is required by netif_set_real_num_*_queues in case > > the > > - * netdev has been registered by this point (if this function > > was called > > - * in the reload or resume flow). > > + * rtnl_lock is required by netif_set_xps_queue. > > */ > > There is a reason why it is conditional: > we had a bug in the past of double locking here: > > [ 4255.283960] echo/644 is trying to acquire lock: > > [ 4255.285092] ffffffff85101f90 (rtnl_mutex){+..}, at: > mlx5e_attach_netdev0xd4/0×3d0 [mlx5_core] > > [ 4255.287264] > > [ 4255.287264] but task is already holding lock: > > [ 4255.288971] ffffffff85101f90 (rtnl_mutex){+..}, at: > ipoib_vlan_add0×7c/0×2d0 [ib_ipoib] > > ipoib_vlan_add is called under rtnl and will eventually call > mlx5e_attach_netdev, we don't have much control over this in mlx5 > driver since the rdma stack provides a per-prepared netdev to attach to > our hw. maybe it is time we had a nested rtnl lock ..
Thanks for the explanation. So as you said, we can't based the locking decision only on the driver own state / information... What about `take_rtnl = !rtnl_is_locked();`? Thanks! Antoine