From: Felix Manlunas <felix.manlu...@cavium.com>
Date: Tue, 4 Apr 2017 19:26:57 -0700

> Detection of watchdog timeout of Octeon cores is flawed and susceptible to
> false alarms.  Refactor by removing the detection code, and in its place,
> leverage existing code that monitors for an indication from the NIC
> firmware that an Octeon core crashed; expand the meaning of the indication
> to "an Octeon core crashed or its watchdog timer expired".  Detection of
> watchdog timeout is now delegated to an exception handler in the NIC
> firmware; this is free of false alarms.
> 
> Also if there's an Octeon core crash or watchdog timeout:
> (1) Disable VF Ethernet links.
> (2) Decrement the module refcount by an amount equal to the number of
>     active VFs of the NIC whose Octeon core crashed or had a watchdog
>     timeout.  The refcount will continue to reflect the active VFs of
>     other liquidio NIC(s) (if present) whose Octeon cores are faultless.
> 
> Item (2) is needed to avoid the case of not being able to unload the driver
> because the module refcount is stuck at some non-zero number.  There is
> code that, in normal cases, decrements the refcount upon receiving a
> message from the firmware that a VF driver was unloaded.  But in
> exceptional cases like an Octeon core crash or watchdog timeout, arrival of
> that particular message from the firmware might be unreliable.  That normal
> case code is changed to not touch the refcount in the exceptional case to
> avoid contention (over the refcount) with the liquidio_watchdog kernel
> thread who will carry out item (2).
> 
> Signed-off-by: Felix Manlunas <felix.manlu...@cavium.com>
> Signed-off-by: Derek Chickles <derek.chick...@cavium.com>

Applied, thanks.

Reply via email to