On Sat, May 09, 2020 at 11:35:46AM -0700, Jakub Kicinski wrote: > On Sat, 9 May 2020 04:35:37 +0000 Luis Chamberlain wrote: > > Device driver firmware can crash, and sometimes, this can leave your > > system in a state which makes the device or subsystem completely > > useless. Detecting this by inspecting /proc/sys/kernel/tainted instead > > of scraping some magical words from the kernel log, which is driver > > specific, is much easier. So instead this series provides a helper which > > lets drivers annotate this and shows how to use this on networking > > drivers. > > > > My methodology for finding when firmware crashes is to git grep for > > "crash" and then doing some study of the code to see if this indeed > > a place where the firmware crashes. In some places this is quite > > obvious. > > > > I'm starting off with networking first, if this gets merged later on I > > can focus on the other drivers, but I already have some work done on > > other subsytems. > > > > Review, flames, etc are greatly appreciated. > > Tainting itself may be useful, but that's just the first step. I'd much > rather see folks start using the devlink health infrastructure. Devlink > is netlink based, but it's _not_ networking specific (many of its > optional features obviously are, but don't let that mislead you). > > With devlink health we get (a) a standard notification on the failure; > (b) information/state dump in a (somewhat) structured form, which can be > collected & shared with vendors; (c) automatic remediation (usually > device reset of some scope).
It indeed sounds very useful! > Now regarding the tainting - as I said it may be useful, but don't we > have to define what constitutes a "firmware crash"? Yes indeed, I missed clarifying this in the documentation. I'll do so in my next respin. > There are many > failure modes, some perfectly recoverable (e.g. processing queue hang), > some mere bugs (e.g. device fails to initialize some functions). All of > them may impact the functioning of the system. How do we choose those > that taint? Its up to the maintainers of the device driver, what I was aiming for were those firmware crashes which indeed *can* have an impact on user experience, and can *even* potentially require a driver removal / addition to to get things back in order again. Luis