Re: IEP-14: Ignite failures handling (Discussion)

Dmitriy Setrakyan Wed, 14 Mar 2018 19:49:03 -0700

On Wed, Mar 14, 2018 at 7:12 PM, Andrey Kornev <andrewkor...@hotmail.com>
wrote:


> I'm not disagreeing with you, Dmitriy.
>
> What I'm trying to say is that if we assume that a serious enough bug or
> some environmental issue prevents Ignite node from functioning correctly,
> then it's only logical to assume that Ignite process is completely hosed
> (for example, due to a very very long STW pause) and can't make any
> progress at all. In a situation like this the application can't reason
> about the process state, and the process itself may not be able to even
> kill itself. The only reliable way to handle cases like that is to have an
> external observer (a health monitoring tool) that is not itself affected by
> the bug or the env issue and can either make a decision by itself or send a
> notification to the SRE team.
>

Agree about the external observers, but that is something a user should do
outside of Ignite.


> In my previous post I only suggest to go easy on the "cleverness" of the
> self-monitoring implementation as IMHO it won't be used much in production
> environment. I think Ignite as it is already provides sufficient means
> of monitoring its health (they may or may not be robust enough, which is a
> different issue).
>

The approach I am suggesting is pretty simple - "kill" the process in case
of a critical error. The only intelligence I would like to add is to
attempt shutting down the Ignite node gracefully before the "kill" is
executed. If a node is shutdown gracefully, then the restart procedure will
be faster, so it is worthwhile to try.

Some of the critical errors include running out of disk, memory, loosing
Ignite system threads, etc... These errors are truly unrecoverable from the
application stand point and should mostly be handled with a process restart
anyway.

D.

Re: IEP-14: Ignite failures handling (Discussion)

Reply via email to