Hi

From: fengchengwen:
> On 2021/7/22 23:46, Thomas Monjalon wrote:
> > 22/07/2021 15:50, fengchengwen:
> >> Hi, all
> >>
> >>     I notice ethdev support dev_reset ops, which could be used to
> >> recover from errors, and only 13+ drivers support this function.
> >>     And also there is event for reset: RTE_ETH_EVENT_INTR_RESET, and
> >> only 6 drivers support it (most of them are VF).
> >>
> >>     This provides users with two ways to handle hardware errors:
> >>     a. driver report RTE_ETH_EVENT_INTR_RESET, and application do reset
> ops.
> >>     b. application detect errors (the detection method is unclear), and 
> >> call
> >>     reset ops to recover.
> >>
> >>     According to the design of this API, error handling is assigned
> >> to the application, and the driver is only responsible for reporting
> >> events. This simplifies the driver design (for example, the driver
> >> does not need to maintain mutex locks).
> >>
> >>     As we know, many modern NICs come with firmware, have PCIE
> >> interfaces, support SR-IOV, the hardware errors can have: firmware
> >> reboot/PF reset/ VF reset/FLR, but these errors(particularly
> >> firmware/PF) are not addressed in most drivers.
> >>
> >>     Question 1: what do we think of these errors(particularly
> >> firmware/PF)? Do we think that the probability is very low and that
> >> there is no need to deal with them?
> >
> > Even rare errors must be managed.
> 
> Because intel and mlx NIC are widely used, I look at i40e/mlx5 driver code,
> and found:
> 1) i40e PF driver, it only show logs when detect global reset and other error:
>         if (icr0 & I40E_PFINT_ICR0_GRST_MASK)
>                 PMD_DRV_LOG(INFO, "ICR0: global reset requested");
>         if (icr0 & I40E_PFINT_ICR0_PCI_EXCEPTION_MASK)
>                 PMD_DRV_LOG(INFO, "ICR0: PCI exception activated");
>         if (icr0 & I40E_PFINT_ICR0_STORM_DETECT_MASK)
>                 PMD_DRV_LOG(INFO, "ICR0: a change in the storm control 
> state");
>    @Beilei Why not report RESET_EVENT in these cases ? or these error are
> very rarely
>    so just report it. And also, the application may still send or recv packet,
> These
>    resets, if not handled correctly, do not cause the hardware/driver to hang 
> ?
> 
> 2) mlx5 PF driver, I notice there is a mlx5_dev_interrupt_device_fatal which
> will remove the device.
>    @Matan Why not report RESET_EVENT in these cases ? so the application
> can be recovered quickly.

RTE_ETH_EVENT_INTR_RMV is reported by the driver to notify the app that the 
device was physically plugged out from the PCI slot.
So, when the app sees this event, it should free all the SW resources of this 
device(call port close to the PMD by the way).

RTE_ETH_EVENT_INTR_RESET,       /**< reset interrupt event, sent to VF on PF 
reset */
Looks like VF-PF communication, this is not our case of plugged out which is 
more fatal.

Matan


> >
> >>     Question 2: I prefer to put error handling in the application
> >> layer, because doing it in the driver can make the driver complex,
> >> but there is no app to register the INTR_RESET event handler. I think
> >> we can build a standard handler in testpmd, What do you think?
> >
> > Absolutely. As any ethdev API, it must be tested with testpmd.
> >
> >
> > .
> >

Reply via email to