On 7/23/21 3:33 PM, Ferruh Yigit wrote:
On 7/22/2021 4:46 PM, Thomas Monjalon wrote:
22/07/2021 15:50, fengchengwen:
Hi, all
I notice ethdev support dev_reset ops, which could be used to recover from
errors, and only 13+ drivers support this function.
'rte_eth_dev_reset()' can be used to reset device config to defaults, not have
to be for error recovering.
And also there is event for reset: RTE_ETH_EVENT_INTR_RESET, and only 6
drivers support it (most of them are VF).
This provides users with two ways to handle hardware errors:
a. driver report RTE_ETH_EVENT_INTR_RESET, and application do reset ops.
b. application detect errors (the detection method is unclear), and call
reset ops to recover.
According to the design of this API, error handling is assigned to the
application, and the driver is only responsible for reporting events. This
simplifies the driver design (for example, the driver does not need to maintain
mutex locks).
As we know, many modern NICs come with firmware, have PCIE interfaces,
support SR-IOV, the hardware errors can have: firmware reboot/PF reset/
VF reset/FLR, but these errors(particularly firmware/PF) are not addressed in
most drivers.
Question 1: what do we think of these errors(particularly firmware/PF)? Do
we think that the probability is very low and that there is no need to deal with
them?
Even rare errors must be managed.
+1
+1
Question 2: I prefer to put error handling in the application layer,
because
doing it in the driver can make the driver complex, but there is no app to
register the INTR_RESET event handler. I think we can build a standard handler
in testpmd, What do you think?
Absolutely. As any ethdev API, it must be tested with testpmd.
Testpmd registers for RESET event, but when event received all it does is print
a log, so there is not logic behind it.
If the intention is to add a error handling logic into testpmd, my concern is it
being too complex or too device specific.
Error recovery must not be device specific. Otherwise, every application
needs the specific and will be hard to port across different drivers.
And if there is something to cleanup, or recover etc in application level, it
makes sense application to receive the event and act on it. But if the
reset/recover can be handled in the PMD, if possible transparently, I think that
is better choice.
Application should be notified to stop datapath at least. IMHO it would
be better if application controls the recovery using easy and well
defined algorithm:
- register to be notified about necessity to do the recovery
- receive event
- stop datapath
- do reset
- restore configuration, start
- start datapath
Another thing is I am not sure if what the applications should do on the reset
event clear or same for all PMDs, which is not good.