Re: Ignite Enhancement Proposal #7 (Internal problems detection)

Denis Magda Tue, 21 Nov 2017 09:26:09 -0800

A lack of suggestions and thoughts encourages me to create a ticket:
https://issues.apache.org/jira/browse/IGNITE-6980 
<https://issues.apache.org/jira/browse/IGNITE-6980>


—
Denis

> On Nov 20, 2017, at 2:53 PM, Denis Magda <dma...@apache.org> wrote:
> 
> If an Ignite operation hangs by some reason due to an internal problem or 
> buggy application code it needs to eventual *time out*.
> 
> Take atomic operations case brought by Val to our attention recently:
> http://apache-ignite-developers.2346864.n4.nabble.com/Timeouts-in-atomic-cache-td19839.html
> 
> An application must not freeze waiting for a human being intervention if an 
> atomic update fails internally.
> 
> Even more I would let all possible operation to time out:
> - Ignite compute computations.
> - Ignite services calls.
> - Atomic/transactional cache updates.
> - SQL queries.
> 
> I’m not sure this is covered by any of the tickets from the IEP-7. Any 
> thoughts/suggestion before the one is created?
> 
> —
> Denis
> 
>> On Nov 20, 2017, at 8:56 AM, Anton Vinogradov <avinogra...@gridgain.com> 
>> wrote:
>> 
>> Dmitry,
>> 
>> There's two cases
>> 1) STW duration is long -> notifying monitoring via JMX metric
>> 
>> 2) STW duration exceed N seconds -> no need to wait for something.
>> We already know that node will be segmented or that pause bigger that N
>> seconds will affect cluster performance.
>> Better case is to kill node ASAP to protect the cluster. Some customers
>> have huge timeouts and such node can kill whole cluster in case it will not
>> be killed by watchdog.
>> 
>> On Mon, Nov 20, 2017 at 7:23 PM, Dmitry Pavlov <dpavlov....@gmail.com>
>> wrote:
>> 
>>> Hi Anton,
>>> 
>>>> - GC STW duration exceed maximum possible length (node should be stopped
>>> before
>>> STW finished)
>>> 
>>> Are you sure we should kill node in case long STW? Can we produce warnings
>>> into logs and monitoring tools an wait node to become alive a little bit
>>> longer if we detect STW. In this case we can notify coordinator or other
>>> node, that 'current node is in STW, please wait longer than 3 heartbeat
>>> timeout'.
>>> 
>>> It is probable such pauses will occur not often?
>>> 
>>> Sincerely,
>>> Dmitriy Pavlov
>>> 
>>> пн, 20 нояб. 2017 г. в 18:53, Anton Vinogradov <avinogra...@gridgain.com>:
>>> 
>>>> Igniters,
>>>> 
>>>> Internal problems may and, unfortunately, cause unexpected cluster
>>>> behavior.
>>>> We should determine behavior in case any of internal problem happened.
>>>> 
>>>> Well known internal problems can be split to:
>>>> 1) OOM or any other reason cause node crash
>>>> 
>>>> 2) Situations required graceful node shutdown with custom notification
>>>> - IgniteOutOfMemoryException
>>>> - Persistence errors
>>>> - ExchangeWorker exits with error
>>>> 
>>>> 3) Prefomance issues should be covered by metrics
>>>> - GC STW duration
>>>> - Timed out tasks and jobs
>>>> - TX deadlock
>>>> - Hanged Tx (waits for some service)
>>>> - Java Deadlocks
>>>> 
>>>> I created special issue [1] to make sure all these metrics will be
>>>> presented at WebConsole or VisorConsole (what's preferred?)
>>>> 
>>>> 4) Situations required external monitoring implementation
>>>> - GC STW duration exceed maximum possible length (node should be stopped
>>>> before STW finished)
>>>> 
>>>> All this problems were reported by different persons different time ago,
>>>> So, we should reanalyze each of them and, possible, find better ways to
>>>> solve them than it described at issues.
>>>> 
>>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention something
>>>> else :)
>>>> 
>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961
>>>> [2]
>>>> 
>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
>>> 7%3A+Ignite+internal+problems+detection
>>>> 
>>> 
>

Re: Ignite Enhancement Proposal #7 (Internal problems detection)

Reply via email to