On 3/8/2021 7:59 PM, Jakub Kicinski wrote:
On Mon, 8 Mar 2021 09:16:00 -0800 Jakub Kicinski wrote:
+ DLH_REMEDY_BAD_PART,
BAD_PART probably indicates that the reporter (or any command line
execution) cannot recover the issue.
As the suggested remedy is static per reporter's recover method, it
doesn't make sense for one to set a recover method that by design cannot
recover successfully.
Maybe we should extend devlink_health_reporter_state with POWER_CYCLE,
REIMAGE and BAD_PART? To indicate the user that for a successful
recovery, it should run a non-devlink-health operation?
Hm, export and extend devlink_health_reporter_state? I like that idea.
Trying to type it up it looks less pretty than expected.
Let's looks at some examples.
A queue reporter, say "rx", resets the queue dropping all outstanding
buffers. As previously mentioned when the normal remediation fails user
is expected to power cycle the machine or maybe swap the card. The
device itself does not have a crystal ball.
Not sure, reopen the queue, or reinit the driver might also be good in
case of issue in the SW/HW queue context for example. But I agree that
RX reporter can't tell from its perspective what further escalation is
needed in case its local defined operations failed.
A management FW reporter "fw", has a auto recovery of FW reset
(REMEDY_RESET). On failure -> power cycle.
An "io" reporter (PCI link had to be trained down) can only return
a hardware failure (we should probably have a HW failure other than
BAD_PART for this).
Flash reporters - the device will know if the flash had a bad block
or the entire part is bad, so probably can have 2 reporters for this.
Most of the reporters would only report one "action" that can be
performed to fix them. The cartesian product of ->recovery types vs
manual recovery does not seem necessary. And drivers would get bloated
with additional boilerplate of returning ERROR_NEED_POWER_CYCLE for
_all_ cases with ->recovery. Because what else would the fix be if
software-initiated reset didn't work?
OK, I see your point.
If I got you right, this is the conclusions so far:
1. Each reporter with recover callback will have to supply a remedy
definition.
2. We shouldn't have POWER_CYCLE, REIMAGE and BAD_PART as a remedy,
because these are not valid reporter recover flows in any case.
3. If a reporter will fail to recover, its status shall remain as error,
and it is out of the reporter's scope to advise the administrator on
further actions.