On Mon, Jun 15, 2020 at 3:21 PM David Christensen <d...@linux.vnet.ibm.com> wrote: > > On 6/15/20 1:45 PM, Michael Chan wrote: > > On Mon, Jun 15, 2020 at 12:01 PM David Christensen > > <d...@linux.vnet.ibm.com> wrote: > >> > >> The driver function tg3_io_error_detected() calls napi_disable twice, > >> without an intervening napi_enable, when the number of EEH errors exceeds > >> eeh_max_freezes, resulting in an indefinite sleep while holding rtnl_lock. > >> > >> The function is called once with the PCI state pci_channel_io_frozen and > >> then called again with the state pci_channel_io_perm_failure when the > >> number of EEH failures in an hour exceeds eeh_max_freezes. > >> > >> Protecting the calls to napi_enable/napi_disable with a new state > >> variable prevents the long sleep. > > > > This works, but I think a simpler fix is to check tp->pcierr_recovery > > in tg3_io_error_detected() and skip most of the tg3 calls (including > > the one that disables NAPI) if the flag is true. > > This might be the smallest change that would work. Does it make sense > to the reader? > > diff --git a/drivers/net/ethernet/broadcom/tg3.c > b/drivers/net/ethernet/broadcom/tg3.c > index 7a3b22b35238..1f37c69d213d 100644 > --- a/drivers/net/ethernet/broadcom/tg3.c > +++ b/drivers/net/ethernet/broadcom/tg3.c > @@ -18168,8 +18168,8 @@ static pci_ers_result_t > tg3_io_error_detected(struct pci_dev *pdev, > > rtnl_lock(); > > - /* We probably don't have netdev yet */ > - if (!netdev || !netif_running(netdev)) > + /* May be second call or maybe we don't have netdev yet */ > + if (tp->pcierr_recovery || !netdev || !netif_running(netdev))
Dereferencing tp needs to be done after checking netdev. If we don't have netdev, tp won't be valid. Other than that, I think the logic looks good and is quite clear. Thanks.