My preference is (b), even though I think stopping the NodeStore
service should be sufficient (it may not currently be sufficient, I
don't know).

Particularly, I believe that "trying harder" is detrimental to the
overall stability of a cluster/topology. We are dealing with a
possibly faulty instance, so who can decide that it is ok again after
trying harder? The faulty instance itself?

"Read-only" doesn't sound too useful either, because that may fool
clients into thinking they are dealing with a "healthy" instance for
longer than necessary and thus can lead to bigger issues downstream.

I believe that "fail early and fail often" is the path to a stable cluster.

Regards
Julian

On Thu, Sep 10, 2015 at 6:43 PM, Stefan Egli <[email protected]> wrote:
> On 09/09/15 18:11, "Stefan Egli" <[email protected]> wrote:
>
>>On 09/09/15 18:01, "Stefan Egli" <[email protected]> wrote:
>>
>>>I think if the observers would all be 'OSGi-ified' then this could be
>>>achieved. But currently eg the BackgroundObserver is just a pojo and not
>>>an osgi component (thus doesn't support any activate/deactivate method
>>>hooks).
>>
>>.. I take that back - going via OsgiWhiteboard should work as desired - so
>>perhaps implementing deactivate/activate methods in the
>>(Background)Observer(s) would do the trick .. I'll give it a try ..
>
> ootb this wont work as the BackgroundObserver, as one example, is not an
> OSGi component, so wont get any deactivate/activate calls atm. so to
> achieve this, it would have to be properly OSGi-ified - something which
> sounds like a bigger task and not only limited to this one class - which
> means making DocumentNodeStore 'restart capable' sounds like a bigger task
> too and the question is indeed if it is worth while ('will it work?') or
> if there are alternatives..
>
> which brings me back to the original question as to what should be done in
> case of a lease failure - to recap the options left (if System.exit is not
> one of them) are:
>
> a) 'go read-only': prevent writes by throwing exceptions from this moment
> until eternity
>
> b) 'stop oak': stop the oak-core bundle (prevent writes by throwing
> exceptions for those still reaching out for the nodeStore)
>
> c) 'try harder': try to reactivate the lease - continue allowing writes -
> and make sure the next backgroundWrite has correctly updated the
> 'unsavedLastRevisions' (cos others could have done a recover of this node,
> so unsavedLastRevisions contains superfluous stuff that must no longer be
> written). this would open the door for edge cases ('change of longer time
> window with multiple leaders') but perhaps is not entirely impossible...
>
> additionally/independently:
>
> * in all cases the discovery-lite descriptor should expose this lease
> failure/partitioning situation - so that anyone can react who would like
> to, esp should anyone no longer assume that the local instance is leader
> or part of the cluster - and to support that optional Sling Health Check
> which still does a System.exit :)
>
> * also, we should probably increase the lease thread's priority to reduce
> the likelihood of the lease timing out (same would be true for
> discovery.impl's heartbeat thread)
>
>
> * plus increasing the lease time from 1min to perhaps 5min as the default
> would also reduce the number of cases that hit problems dramatically
>
> wdyt?
>
> Cheers,
> Stefan
>
>

Reply via email to