On Thu, 13 Sep 2018 11:18:15 +0300, Eran Ben Elisha wrote: > The health spec is targeted for Real Time Alerting, in order to know when > something bad had happened to a PCI device
By spec you mean some standards body spec you implement or this proposal is a spec? > - Provide alert debug information > - Self healing > - If problem needs vendor support, provide a way to gather all needed > debugging > information. > > The health contains sensors which sense for malfunction. Once sensor > triggered, > actions such as logs and correction can be taken. > Sensors are sensing the health state and can trigger correction action. > > The sensors are divided into the following groups > - Hardware sensor - a sensor which is triggered by the device due to > malfunction. > - Software sensor - a sensor which is triggered by the software due to > malfunction. > Both group of sensors can be triggered due to error event or due to a > periodic check. > > Actions are the way to handle sensor events. Action can be in one of the > following groups: > - Dump - SW trace, SW dump, HW trace, HW dump > - Reset - Surgical correction (e.g. modify Q, flush Q, reset of device, etc) > Actions can be performed by SW or HW. > > User is allowed to enable or disable sensors and sensor2action mapping. > > This RFC man page patch describes the suggested API of devlink-health in order > to control sensors and actions. I like the idea of configuring response to events like this, although I'm not sure the name sensor is appropriate here - perhaps exception or error would be better? Are there going to be values reported? I'm not so sure about HW sensors in relation to existing HWMON infrastructure... I assume you're targeting things like say some HW engine/block reporting it encountered an error? Sounds good, too. Are the actions all envisioned to be performed by the driver? Firmware? Hardware? I guess that distinction can be added later. For FW/HW actions we would go back to the problem of persistence of the setting since it was only implemented for params :S Is the dump option going to tie back into region snapshots?