Re: Automatic Handling of Long Stop-the-World Pauses

Pavel Kovalenko Mon, 02 Jul 2018 12:28:56 -0700

Denis,

I think, JVM can't easily help to itself if it's in SW pause. Most
solutions what I saw about handling such situations are checking heartbeats
on other nodes or run in parallel supervisor process which can detect that
JVM with Ignite in SW.


2018-07-02 20:54 GMT+03:00 Denis Magda <dma...@apache.org>:

> Igniters,
>
> Pulling this discussion up. Any thoughts?
>
> --
> Denis
>
> On Thu, Jun 21, 2018 at 3:52 PM Denis Magda <dma...@apache.org> wrote:
>
> > Igniters,
> >
> > It's a pleasure to see how our project is evolving in a directing of
> being
> > a self-healing solution:
> >
> >    - Ignite can already handle critical failures such as OOM, File I/O
> >    issues, etc. [1]
> >    - There is an endeavor to fix cluster lock-ins due to partition map
> >    exchange issues. [2]
> >
> > There is one more notorious problem that might affect Ignite deployments
> > which is long stop-the-world GC pauses.
> >
> > I know we did a little progress in this direction [3] by providing
> > particular metrics that help to monitor the pauses. Why don't we keep the
> > pace and teach Ignite to help itself if it sees there is a node that
> brings
> > down overall cluster performance due to an STP?
> >
> > I would create policies similar to the critical failures policies [4] or
> > just add a long STP to the list of critical failures and reuse existing
> > functionality.
> >
> > Thoughts? Anyone who'd like to implement the feature?
> >
> > [1] https://apacheignite.readme.io/docs/critical-failures-handling
> > [2]
> > http://apache-ignite-developers.2346864.n4.nabble.
> com/IEP-25-Partition-Map-Exchange-hangs-resolving-td31819.html
> > [3] https://issues.apache.org/jira/browse/IGNITE-6171
> > [4]
> > https://apacheignite.readme.io/docs/critical-failures-
> handling#section-failure-handling
> >
>

Re: Automatic Handling of Long Stop-the-World Pauses

Reply via email to