23/07/2021 14:33, Ferruh Yigit: > On 7/22/2021 4:46 PM, Thomas Monjalon wrote: > > 22/07/2021 15:50, fengchengwen: > >> Hi, all > >> > >> I notice ethdev support dev_reset ops, which could be used to recover > >> from > >> errors, and only 13+ drivers support this function. > > 'rte_eth_dev_reset()' can be used to reset device config to defaults, not have > to be for error recovering. > > >> And also there is event for reset: RTE_ETH_EVENT_INTR_RESET, and only 6 > >> drivers support it (most of them are VF). > >> > >> This provides users with two ways to handle hardware errors: > >> a. driver report RTE_ETH_EVENT_INTR_RESET, and application do reset > >> ops. > >> b. application detect errors (the detection method is unclear), and > >> call > >> reset ops to recover. > >> > >> According to the design of this API, error handling is assigned to the > >> application, and the driver is only responsible for reporting events. This > >> simplifies the driver design (for example, the driver does not need to > >> maintain > >> mutex locks). > >> > >> As we know, many modern NICs come with firmware, have PCIE interfaces, > >> support SR-IOV, the hardware errors can have: firmware reboot/PF reset/ > >> VF reset/FLR, but these errors(particularly firmware/PF) are not addressed > >> in > >> most drivers. > >> > >> Question 1: what do we think of these errors(particularly > >> firmware/PF)? Do > >> we think that the probability is very low and that there is no need to > >> deal with > >> them? > > > > Even rare errors must be managed. > > > > +1 > > >> Question 2: I prefer to put error handling in the application layer, > >> because > >> doing it in the driver can make the driver complex, but there is no app to > >> register the INTR_RESET event handler. I think we can build a standard > >> handler > >> in testpmd, What do you think? > > > > Absolutely. As any ethdev API, it must be tested with testpmd. > > > > Testpmd registers for RESET event, but when event received all it does is > print > a log, so there is not logic behind it. > > If the intention is to add a error handling logic into testpmd, my concern is > it > being too complex or too device specific.
It shows a problem in the API. We don't have a clear generic recovering process. > And if there is something to cleanup, or recover etc in application level, it > makes sense application to receive the event and act on it. But if the > reset/recover can be handled in the PMD, if possible transparently, I think > that > is better choice. > > Another thing is I am not sure if what the applications should do on the reset > event clear or same for all PMDs, which is not good. Indeed we should improve this area, and implement a logic in testpmd.