I'm not disagreeing with you, Dmitriy. What I'm trying to say is that if we assume that a serious enough bug or some environmental issue prevents Ignite node from functioning correctly, then it's only logical to assume that Ignite process is completely hosed (for example, due to a very very long STW pause) and can't make any progress at all. In a situation like this the application can't reason about the process state, and the process itself may not be able to even kill itself. The only reliable way to handle cases like that is to have an external observer (a health monitoring tool) that is not itself affected by the bug or the env issue and can either make a decision by itself or send a notification to the SRE team.
In my previous post I only suggest to go easy on the "cleverness" of the self-monitoring implementation as IMHO it won't be used much in production environment. I think Ignite as it is already provides sufficient means of monitoring its health (they may or may not be robust enough, which is a different issue). Regards Andrey ________________________________ From: Dmitriy Setrakyan <dsetrak...@apache.org> Sent: Wednesday, March 14, 2018 6:22 PM To: dev@ignite.apache.org Subject: Re: IEP-14: Ignite failures handling (Discussion) On Wed, Mar 14, 2018 at 3:36 PM, Andrey Kornev <andrewkor...@hotmail.com> wrote: > If I were the one responsible for running Ignite-based applications (be it > embedded or standalone Ignite) in my company's datacenter, I'd prefer the > application nodes simply make their current state readily available to > external tools (via JMX, health checks, etc.) and leave the decision of > when to die and when to continue to run up to me. The last thing I need in > production is a too clever an application that decides to kill itself based > on its local (perhaps confused) state. > > Usually SRE teams build all sorts of technology-specific tools to monitor > health of the applications and they like to be as much in control as > possible when it comes to killing processes. > > I guess what I'm saying is this: keep things simple. Do not over engineer. > In real production environments the companies will most likely have this > feature disabled (I know I would) and instead rely on their own tooling for > handling failures. > > Andrey, our priority should be to keep the cluster operational. If a frozen Ignite node is kept around, the whole cluster becomes un-operational. I bet this is not what you would prefer in production either. However, if we kill the process, then the cluster should continue to operate. We are talking about a distributed system in which a failure of one node should not matter. If we want to keep this promise to the users, then we must kill the process if Ignite node freezes. Also, keep in mind that we are talking about the "default" behavior. If you are not happy with the "default" mode, then you will be able to configure other behaviors, like keeping the frozen Ignite node around, if you like. D.