From: Felix Manlunas <felix.manlu...@cavium.com> Date: Tue, 4 Apr 2017 19:26:57 -0700
> Detection of watchdog timeout of Octeon cores is flawed and susceptible to > false alarms. Refactor by removing the detection code, and in its place, > leverage existing code that monitors for an indication from the NIC > firmware that an Octeon core crashed; expand the meaning of the indication > to "an Octeon core crashed or its watchdog timer expired". Detection of > watchdog timeout is now delegated to an exception handler in the NIC > firmware; this is free of false alarms. > > Also if there's an Octeon core crash or watchdog timeout: > (1) Disable VF Ethernet links. > (2) Decrement the module refcount by an amount equal to the number of > active VFs of the NIC whose Octeon core crashed or had a watchdog > timeout. The refcount will continue to reflect the active VFs of > other liquidio NIC(s) (if present) whose Octeon cores are faultless. > > Item (2) is needed to avoid the case of not being able to unload the driver > because the module refcount is stuck at some non-zero number. There is > code that, in normal cases, decrements the refcount upon receiving a > message from the firmware that a VF driver was unloaded. But in > exceptional cases like an Octeon core crash or watchdog timeout, arrival of > that particular message from the firmware might be unreliable. That normal > case code is changed to not touch the refcount in the exceptional case to > avoid contention (over the refcount) with the liquidio_watchdog kernel > thread who will carry out item (2). > > Signed-off-by: Felix Manlunas <felix.manlu...@cavium.com> > Signed-off-by: Derek Chickles <derek.chick...@cavium.com> Applied, thanks.