A lack of suggestions and thoughts encourages me to create a ticket: https://issues.apache.org/jira/browse/IGNITE-6980 <https://issues.apache.org/jira/browse/IGNITE-6980>
— Denis > On Nov 20, 2017, at 2:53 PM, Denis Magda <dma...@apache.org> wrote: > > If an Ignite operation hangs by some reason due to an internal problem or > buggy application code it needs to eventual *time out*. > > Take atomic operations case brought by Val to our attention recently: > http://apache-ignite-developers.2346864.n4.nabble.com/Timeouts-in-atomic-cache-td19839.html > > An application must not freeze waiting for a human being intervention if an > atomic update fails internally. > > Even more I would let all possible operation to time out: > - Ignite compute computations. > - Ignite services calls. > - Atomic/transactional cache updates. > - SQL queries. > > I’m not sure this is covered by any of the tickets from the IEP-7. Any > thoughts/suggestion before the one is created? > > — > Denis > >> On Nov 20, 2017, at 8:56 AM, Anton Vinogradov <avinogra...@gridgain.com> >> wrote: >> >> Dmitry, >> >> There's two cases >> 1) STW duration is long -> notifying monitoring via JMX metric >> >> 2) STW duration exceed N seconds -> no need to wait for something. >> We already know that node will be segmented or that pause bigger that N >> seconds will affect cluster performance. >> Better case is to kill node ASAP to protect the cluster. Some customers >> have huge timeouts and such node can kill whole cluster in case it will not >> be killed by watchdog. >> >> On Mon, Nov 20, 2017 at 7:23 PM, Dmitry Pavlov <dpavlov....@gmail.com> >> wrote: >> >>> Hi Anton, >>> >>>> - GC STW duration exceed maximum possible length (node should be stopped >>> before >>> STW finished) >>> >>> Are you sure we should kill node in case long STW? Can we produce warnings >>> into logs and monitoring tools an wait node to become alive a little bit >>> longer if we detect STW. In this case we can notify coordinator or other >>> node, that 'current node is in STW, please wait longer than 3 heartbeat >>> timeout'. >>> >>> It is probable such pauses will occur not often? >>> >>> Sincerely, >>> Dmitriy Pavlov >>> >>> пн, 20 нояб. 2017 г. в 18:53, Anton Vinogradov <avinogra...@gridgain.com>: >>> >>>> Igniters, >>>> >>>> Internal problems may and, unfortunately, cause unexpected cluster >>>> behavior. >>>> We should determine behavior in case any of internal problem happened. >>>> >>>> Well known internal problems can be split to: >>>> 1) OOM or any other reason cause node crash >>>> >>>> 2) Situations required graceful node shutdown with custom notification >>>> - IgniteOutOfMemoryException >>>> - Persistence errors >>>> - ExchangeWorker exits with error >>>> >>>> 3) Prefomance issues should be covered by metrics >>>> - GC STW duration >>>> - Timed out tasks and jobs >>>> - TX deadlock >>>> - Hanged Tx (waits for some service) >>>> - Java Deadlocks >>>> >>>> I created special issue [1] to make sure all these metrics will be >>>> presented at WebConsole or VisorConsole (what's preferred?) >>>> >>>> 4) Situations required external monitoring implementation >>>> - GC STW duration exceed maximum possible length (node should be stopped >>>> before STW finished) >>>> >>>> All this problems were reported by different persons different time ago, >>>> So, we should reanalyze each of them and, possible, find better ways to >>>> solve them than it described at issues. >>>> >>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention something >>>> else :) >>>> >>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961 >>>> [2] >>>> >>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP- >>> 7%3A+Ignite+internal+problems+detection >>>> >>> >