On Tue, Apr 29, 2025 at 04:20:21PM +0100, Simon Horman wrote:
> + Toke
> 
> On Mon, Apr 28, 2025 at 02:54:49PM +0300, Ian Ray wrote:
> > A rare [1] race condition is observed between the igb_watchdog_task and
> > shutdown on a dual-core i.MX6 based system with two I210 controllers.
> >
> > Using printk, the igb_watchdog_task is hung in igb_read_phy_reg because
> > __igb_shutdown has already called __igb_close.
> >
> > Fix this by locking in igb_watchdog_task (in the same way as is done in
> > igb_reset_task).
> >
> > reboot             kworker
> >
> > __igb_shutdown
> >   rtnl_lock
> >   __igb_close
> >   :                igb_watchdog_task
> >   :                :
> >   :                igb_read_phy_reg (hung)
> >   rtnl_unlock
> >
> > [1] Note that this is easier to reproduce with 'initcall_debug' logging
> > and additional and printk logging in igb_main.
> >
> > Signed-off-by: Ian Ray <[email protected]>
> 
> Hi Ian,
> 
> Thanks for your patch.
> 
> While I think that the simplicity of this approach may well be appropriate
> as a fix for the problem described I do have a concern.
> 
> I am worried that taking RTNL each time the watchdog tasks will create
> unnecessary lock contention. That may manifest in weird and wonderful ways
> in future.  Maybe this patch doesn't make things materially worse in that
> regard.  But it would be nice to have a plan to move away from using RTNL,
> as is happening elsewhere.
> 
> ...

Hi Simon,

Many thanks for the review.  I've been reflecting on the patch (and
discussing internally) and I think it would be better to model the
behaviour on igb_remove instead of igb_reset_task.  Meaning that the
timer should be deleted, and the work cancelled, after setting bit
IGB_DOWN.  This would mirror igb_up.  (And has the advantage of not
using the RTNL.)

(As you can probably tell) I am not very familiar with this subsystem,
but the modified proposal, below, works well in my testing.  I will
happily send a V2 if you think this is a better direction.

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 291348505868..d4b905469cc2 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2173,10 +2173,14 @@ void igb_down(struct igb_adapter *adapter)
        u32 tctl, rctl;
        int i;

-       /* signal that we're down so the interrupt handler does not
-        * reschedule our watchdog timer
+       /* The watchdog timer may be rescheduled, so explicitly
+        * disable watchdog from being rescheduled.
         */
        set_bit(__IGB_DOWN, &adapter->state);
+       del_timer_sync(&adapter->watchdog_timer);
+       del_timer_sync(&adapter->phy_info_timer);
+
+       cancel_work_sync(&adapter->watchdog_task);

        /* disable receives in the hardware */
        rctl = rd32(E1000_RCTL);
@@ -2207,11 +2211,6 @@ void igb_down(struct igb_adapter *adapter)
                }
        }

-       del_timer_sync(&adapter->watchdog_timer);
-       del_timer_sync(&adapter->phy_info_timer);
-
-       cancel_work_sync(&adapter->watchdog_task);
-
        /* record the stats before reset*/
        spin_lock(&adapter->stats64_lock);
        igb_update_stats(adapter);

Reply via email to