[ 
https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Kuznetsov updated IGNITE-6587:
-------------------------------------
    Description: 
As described in [1], each Ignite node has a number of system-critical threads. 
We should implement a periodic check that calls failure handler when one of the 
following conditions has been detected:
# Critical thread is not alive anymore.
# Critical thread remains in BLOCKED state for a long time. 

Actual list of system-critical threads can be found at [1].

[1] 
https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling

  was:
We need to come up with a 'watchdog service' to monitor for Ignite node local 
health and kill the process under some critical conditions.
For example, if one of the mission-critical Ignite threads die, the Ignite node 
must be stopped.
At the first glance, the list of critical threads is:
disco-event-worker
tcp-disco-sock-reader
tcp-disco-srvr
tcp-disco-msg-worker
tcp-comm-worker
grid-nio-worker-tcp-comm
exchange-worker
sys-stripe
grid-timeout-worker
db-checkpoint-thread
wal-file-archiver
ttl-cleanup-worker
nio-acceptor

The mechanism should support pluggable components so that self-check can be 
extended via plugins.


> Ignite watchdog service
> -----------------------
>
>                 Key: IGNITE-6587
>                 URL: https://issues.apache.org/jira/browse/IGNITE-6587
>             Project: Ignite
>          Issue Type: Improvement
>          Components: general
>    Affects Versions: 2.2
>            Reporter: Alexey Goncharuk
>            Assignee: Andrey Gura
>            Priority: Major
>              Labels: IEP-5
>             Fix For: 2.6
>
>         Attachments: watchdog.sh
>
>
> As described in [1], each Ignite node has a number of system-critical 
> threads. We should implement a periodic check that calls failure handler when 
> one of the following conditions has been detected:
> # Critical thread is not alive anymore.
> # Critical thread remains in BLOCKED state for a long time. 
> Actual list of system-critical threads can be found at [1].
> [1] 
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to