I'm not disagreeing with you, Dmitriy.

What I'm trying to say is that if we assume that a serious enough bug or some 
environmental issue prevents Ignite node from functioning correctly, then it's 
only logical to assume that Ignite process is completely hosed (for example, 
due to a very very long STW pause) and can't make any progress at all. In a 
situation like this the application can't reason about the process state, and 
the process itself may not be able to even kill itself. The only reliable way 
to handle cases like that is to have an external observer (a health monitoring 
tool) that is not itself affected by the bug or the env issue and can either 
make a decision by itself or send a notification to the SRE team.

In my previous post I only suggest to go easy on the "cleverness" of the 
self-monitoring implementation as IMHO it won't be used much in production 
environment. I think Ignite as it is already provides sufficient means of 
monitoring its health (they may or may not be robust enough, which is a 
different issue).

Regards
Andrey

________________________________
From: Dmitriy Setrakyan <dsetrak...@apache.org>
Sent: Wednesday, March 14, 2018 6:22 PM
To: dev@ignite.apache.org
Subject: Re: IEP-14: Ignite failures handling (Discussion)

On Wed, Mar 14, 2018 at 3:36 PM, Andrey Kornev <andrewkor...@hotmail.com>
wrote:

> If I were the one responsible for running Ignite-based applications (be it
> embedded or standalone Ignite) in my company's datacenter, I'd prefer the
> application nodes simply make their current state readily available to
> external tools (via JMX, health checks, etc.) and leave the decision of
> when to die and when to continue to run up to me. The last thing I need in
> production is a too clever an application that decides to kill itself based
> on its local (perhaps confused) state.
>
> Usually SRE teams build all sorts of technology-specific tools to monitor
> health of the applications and they like to be as much in control as
> possible when it comes to killing processes.
>
> I guess what I'm saying is this: keep things simple. Do not over engineer.
> In real production environments the companies will most likely have this
> feature disabled (I know I would) and instead rely on their own tooling for
> handling failures.
>
>
Andrey, our priority should be to keep the cluster operational. If a frozen
Ignite node is kept around, the whole cluster becomes un-operational. I bet
this is not what you would prefer in production either. However, if we kill
the process, then the cluster should continue to operate.

We are talking about a distributed system in which a failure of one node
should not matter. If we want to keep this promise to the users, then we
must kill the process if Ignite node freezes.

Also, keep in mind that we are talking about the "default" behavior. If you
are not happy with the "default" mode, then you will be able to configure
other behaviors, like keeping the frozen Ignite node around, if you like.

D.

Reply via email to