netvsc: retry VF hotplug indefinitely until PCI device disappears

Long Li Tue, 05 May 2026 19:06:49 -0700

> 
> On Tue,  5 May 2026 13:44:50 -0700
> Long Li <[email protected]> wrote:
> 
> > After PCI rescan on Azure, the MANA kernel driver can take over 100
> > seconds to probe and create the /sys/bus/pci/devices/<dev>/net directory.
> > The previous fixed retry limit (NETVSC_MAX_HOTADD_RETRY=10, ~12
> > seconds) was insufficient, causing VF re-attach to fail with 'Failed
> > to parse PCI device' on systems with slow MANA driver initialization.
> >
> > Replace the fixed retry limit with an indefinite retry that only gives
> > up when the PCI device itself disappears from sysfs. This is safe because:
> >
> > - The retry uses rte_eal_alarm callbacks which are serialized on the EAL
> >   interrupt thread, preventing races with VF remove or device close paths.
> > - Device close (eth_hn_dev_uninit) explicitly cancels all pending hotplug
> >   alarms via rte_eal_alarm_cancel and frees the context.
> > - If the PCI device is removed while retrying, access() detects the
> >   missing sysfs path and stops immediately.
> >
> > A periodic NOTICE log every 30 retries (~30s) provides visibility into
> > long waits without flooding the log at DEBUG level.
> >
> > Fixes: a2a23a794b3a ("net/netvsc: support VF device hot add/remove")
> > Cc: [email protected]
> > Signed-off-by: Long Li <[email protected]>
> > ---
> 
> Since this series touches lots of stuff, decided to do a more through than 
> normal
> AI review


I have sent v2, with all the comments addressed.

Thanks,
Long


> 
> Review of patch series: net/netvsc VF hotplug improvements (7 patches)
> 
> Series targets stale interface names, slow MANA driver probe, multi-VF PCI
> devices, and adds service-reset event handling. Reviewed against upstream main
> as of clone date.
> 
> =================================================================
> ========
> [PATCH 3/7] net/netvsc: retry full probe when IB device not ready during
>                        hotplug
> =================================================================
> ========
> 
> Warning:
> 
>   drivers/net/netvsc/hn_ethdev.c (around the new ENODEV/EEXIST handling)
> 
>   The original code logged failures from rte_eal_hotplug_add at
>   PMD_DRV_LOG(ERR). The patch lowers all non-ENODEV, non-EEXIST returns
>   to PMD_DRV_LOG(NOTICE):
> 
>       if (ret && ret != -EEXIST)
>               PMD_DRV_LOG(NOTICE,
>                           "Failed to add PCI device %s (ret=%d)",
>                           d->name, ret);
> 
>   -ENOMEM, -EBUSY, -EIO and similar errors from rte_eal_hotplug_add are
>   legitimate failures that operators need to see. Recommend keeping ERR
>   for unrecognized error codes; only -ENODEV and -EEXIST are now expected
>   outcomes worth a quieter log level.
> 
> 
> =================================================================
> ========
> [PATCH 4/7] net/netvsc: add NOTICE-level debug logging for VF hotplug
>                        retry
> =================================================================
> ========
> 
> Warning:
> 
>   drivers/net/netvsc/hn_ethdev.c, multiple sites in netvsc_hotplug_retry
> 
>   The new log calls inside the readdir() loop fire at NOTICE for every
>   directory entry on every retry pass:
> 
>     PMD_DRV_LOG(NOTICE, "%s: checking interface %s in %s (retry %d)", ...)
>     PMD_DRV_LOG(NOTICE, "%s: device %s sa_family=%d not
> ARPHRD_ETHER, ...")
>     PMD_DRV_LOG(NOTICE, "%s: MAC mismatch for %s: got ... expected ...")
> 
>   Combined with patch 1 (indefinite retries) and patch 5 (up to 120
>   mac_retry passes), a multi-NIC VM can produce hundreds of NOTICE lines
>   per VF re-attach. Per-iteration trace belongs at DEBUG; reserve NOTICE
>   for one-shot state transitions (loop start, match found, give-up).
> 
> Info:
> 
>   drivers/net/netvsc/hn_ethdev.c, MAC mismatch log
> 
>   The format string and byte arguments expand the MAC inline:
> 
>       "%02x:%02x:%02x:%02x:%02x:%02x ..."
>       eth_addr.addr_bytes[0], ..., eth_addr.addr_bytes[5],
>       dev->data->mac_addrs->addr_bytes[0], ...
> 
>   rte_ether.h provides RTE_ETHER_ADDR_PRT_FMT and
> RTE_ETHER_ADDR_BYTES()
>   for exactly this; using them is shorter and consistent with the rest
>   of DPDK.
> 
> Info:
> 
>   drivers/net/netvsc/hn_ethdev.c, MAC match block
> 
>   The new "} else {" follows an if-branch that already ends in break, so
>   the else is redundant — the log can sit unindented after the closing
>   brace of the if.
> 
> 
> =================================================================
> ========
> [PATCH 5/7] net/netvsc: retry when no matching MAC found in net directory
> =================================================================
> ========
> 
> Error:
> 
>   Commit message vs. code disagree on the retry budget.
> 
>   Commit message:
>       "This uses a separate mac_retry counter (limit 30, ~30 seconds) so
>        the main retry loop remains unlimited."
> 
>   Code:
>       /* Max retries when net/ directory exists but no matching MAC found.
>        * On multi-NIC PCI devices, a second VF may register later.
>        * 120 retries = ~2 minutes.
>        */
>       #define NETVSC_MAX_MAC_RETRY 120
> 
>   Either the commit message or the macro value needs to be corrected so
>   they agree. With NETVSC_HOTADD_RETRY_INTERVAL = 1s, 120 retries is
>   ~2 min, not ~30 s.
> 
> Info:
> 
>   drivers/net/netvsc/hn_ethdev.c, in the new "no MAC match" block
> 
>       if (di) {
>               closedir(di);
>               di = NULL;
> 
>   DPDK style requires explicit pointer comparison: "if (di != NULL)".
> 
> 
> =================================================================
> ========
> [PATCH 6/7] net/netvsc: forward per-queue stats from VF device
> =================================================================
> ========
> 
> Warning:
> 
>   drivers/net/netvsc/hn_vf.c, hn_vf_stats_get
> 
>       if (vf_dev->dev_ops->stats_get)
>               ret = vf_dev->dev_ops->stats_get(vf_dev, stats, qstats);
>       else
>               ret = rte_eth_stats_get(vf_dev->data->port_id, stats);
> 
>   Reaching directly into another ethdev's dev_ops->stats_get is not a
>   pattern used anywhere else in DPDK (bonding, failsafe, etc. all go
>   through rte_eth_stats_get). The shortcut bypasses three things the
>   public wrapper does:
> 
>     - RTE_ETH_VALID_PORTID_OR_ERR_RET on the VF port_id
>     - memset(stats, 0) and stats->rx_nombuf = data->rx_mbuf_alloc_failed
>     - eth_err() return-code translation
> 
>   The memset is harmless here because the outer ethdev wrapper has
>   already zeroed stats/qstats for the netvsc port. The rx_nombuf
>   attribution, however, changes: previously stats->rx_nombuf was
>   pre-loaded with the VF's data->rx_mbuf_alloc_failed before the VF's
>   callback ran; now it's whatever was set for the netvsc port. Worth
>   documenting in the commit message if intentional.
> 
>   Also, the dev_ops pointer test uses an implicit comparison:
> 
>       if (vf_dev->dev_ops->stats_get)
> 
>   Per DPDK style this should be "!= NULL".
> 
>   A cleaner long-term fix would be to extend the ethdev API so per-queue
>   stats can be forwarded through the public path, or wrap this dispatch
>   in a small helper with a comment explaining why the public API is
>   insufficient.
> 
> Info:
> 
>   drivers/net/netvsc/hn_vf.c
> 
>   Some VF drivers may return -ENOTSUP or partial data when called with a
>   non-NULL qstats. The patch unconditionally propagates that return,
>   which could turn a previously-successful stats_get into a failure for
>   netvsc users on certain VF drivers. A graceful fallback to
>   rte_eth_stats_get on -ENOTSUP would preserve backward compatibility.
> 
> 
> =================================================================
> ========
> [PATCH 7/7] net/netvsc: handle VF recovery events for service reset
> =================================================================
> ========
> 
> Warning:
> 
>   drivers/net/netvsc/hn_vf.c, hn_eth_recovery_success_callback
> 
>       rte_rwlock_write_lock(&hv->vf_lock);
>       if (hv->vf_ctx.vf_attached && !hv->vf_ctx.vf_vsc_switched) {
>               ret = hn_nvs_set_datapath(hv, NVS_DATAPATH_VF);
>               ...
>       }
> 
>   hn_vf_add_unlocked guards the equivalent NVS_DATAPATH_VF switch with
>   "if (dev->data->dev_started)" precisely to avoid routing traffic to
>   the VF before queues are configured. The new callback omits that
>   check. If the user calls rte_eth_dev_stop() on the netvsc port during
>   the ERR_RECOVERING -> RECOVERY_SUCCESS window, this callback will
>   switch the host data path to a VF whose DPDK queues may have been
>   torn down. Recommend mirroring the dev_started check.
> 
> Warning:
> 
>   drivers/net/netvsc/hn_vf.c, hn_eth_recovering_callback and
>                               hn_eth_recovery_success_callback
> 
>   Both new callbacks acquire hv->vf_lock as a writer directly in the
>   event-callback context. The existing hn_eth_rmv_event_callback
>   intentionally does not take vf_lock — it defers via:
> 
>       rte_eal_alarm_set(1, hn_remove_delayed, hv);
> 
>   to break a possible lock-order coupling with the calling driver (the
>   PMD that fires rte_eth_dev_callback_process may itself be holding an
>   internal lock). Taking vf_lock directly inside the recovery callbacks
>   introduces a different lock-ordering invariant from the rest of the
>   file. Consider either deferring through rte_eal_alarm_set as well, or
>   adding a comment explaining why direct acquisition is safe in the
>   MANA -> netvsc invocation path.
> 
> Info:
> 
>   drivers/net/netvsc/hn_vf.c, hn_eth_recovery_failed_callback
> 
>       rte_eal_alarm_set(1, hn_remove_delayed, hv);
> 
>   The state of hv->vf_ctx.vf_attached is not checked. If RECOVERY_FAILED
>   arrives after a concurrent INTR_RMV has already detached the VF, the
>   alarm will fire hn_remove_delayed against an already-removed port,
>   generating a spurious "Start to remove port 0" log and downstream
>   unregister failures. The damage is limited because hn_remove_delayed
>   takes the lock and re-checks state, but the noise is avoidable with a
>   vf_attached guard before scheduling the alarm.
> 
> 
> =================================================================
> ========
> General notes on the series
> =================================================================
> ========
> 
> Patches 1, 2, and 5 together turn netvsc_hotplug_retry into a genuinely-
> unbounded loop in two dimensions:
> 
>   - opendir failures retry indefinitely (patch 1, capped only by PCI
>     device disappearance)
>   - SIOCGIFHWADDR failures and -ENODEV from rte_eal_hotplug_add reschedule
>     without incrementing any counter (patches 2, 3)
>   - "no matching MAC" retries are bounded by mac_retry / 120 (patch 5)
> 
> Only patch 5's path has a hard upper bound. The other paths rely on the PCI
> device eventually disappearing if the VF never comes up. On a permanently
> broken VF that stays present in sysfs but never registers a netdev, the EAL 
> alarm
> thread will spin forever. Worth confirming that this is the intended 
> behaviour, or
> adding an overall ceiling (e.g., a much higher cap than 30s but still finite) 
> for the
> opendir / ioctl / ENODEV paths.

RE: [EXTERNAL] Re: [PATCH 1/7] net/netvsc: retry VF hotplug indefinitely until PCI device disappears

Reply via email to