Re: Metric showing how many nodes may safely leave the cluster

Denis Magda Fri, 04 Oct 2019 10:07:39 -0700

I'm for the proposal to add new JMX metrics and enhance the existing
tooling. But I would encourage us to integrate this into the new metrics
framework Nikolay has been working on. Otherwise, we will be deprecating
these JMX metrics in a short time frame in favor of the new monitoring APIs.


-
Denis


On Fri, Oct 4, 2019 at 9:33 AM Alexey Goncharuk <alexey.goncha...@gmail.com>
wrote:

> I agree that we should have the ability to read any metric using simple
> Ignite tooling. I am not sure if visor.sh is a good fit - if I
> remember correctly, it will start a daemon node which will bump the
> topology version with all related consequences. I believe in the long term
> it will beneficial to migrate all visor.sh functionality to a more
> lightweight protocol, such as used in control.sh.
>
> As for the metrics, the metric suggested by Ivan totally makes sense to me
> - it is a simple and, actually, quite critical metric. It will be
> completely unusable to select a minimum of some metric for all cache groups
> manually. A monitoring system, on the other hand, might not be available
> when the metric is needed, or may not support aggregation.
>
> --AG
>
> пт, 4 окт. 2019 г. в 18:58, Ivan Rakov <ivan.glu...@gmail.com>:
>
> > Nikolay,
> >
> > Many users start to use Ignite with a small project without
> > production-level monitoring. When proof-of-concept appears to be viable,
> > they tend to expand Ignite usage by growing cluster and adding needed
> > environment (including monitoring systems).
> > Inability to find such basic thing as survival in case of next node
> > crash may affect overall product impression. We all want Ignite to be
> > successful and widespread.
> >
> > > Can you clarify, what do you mean, exactly?
> >
> > Right now user can access metric mentioned by Alex and choose minimum of
> > all cache groups. I want to highlight that not every user understands
> > Ignite and its internals so much to find out that exactly these sequence
> > of actions will bring him to desired answer.
> >
> > > Can you clarify, what do you mean, exactly?
> > > We have a ticket[1] to support metrics output via visor.sh.
> > >
> > > My understanding: we should have an easy way to output metric values
> for
> > each node in cluster.
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-12191
> > I propose to add metric method for aggregated
> > "getMinimumNumberOfPartitionCopies" and expose it to control.sh.
> > My understanding: it's result is critical enough to be accessible in a
> > short path. I've started this topic due to request from user list, and
> > I've heard many similar complaints before.
> >
> > Best Regards,
> > Ivan Rakov
> >
> > On 04.10.2019 17:18, Nikolay Izhikov wrote:
> > > Ivan.
> > >
> > >> We shouldn't force users to configure external tools and write extra
> > code for basic things.
> > > Actually, I don't agree with you.
> > > Having external monitoring system for any production cluster is a
> > *basic* thing.
> > >
> > > Can you, please, define "basic things"?
> > >
> > >> single method for the whole cluster
> > > Can you clarify, what do you mean, exactly?
> > > We have a ticket[1] to support metrics output via visor.sh.
> > >
> > > My understanding: we should have an easy way to output metric values
> for
> > each node in cluster.
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-12191
> > >
> > >
> > > В Пт, 04/10/2019 в 17:09 +0300, Ivan Rakov пишет:
> > >> Max,
> > >>
> > >> What if user simply don't have configured monitoring system?
> > >> Knowing whether cluster will survive node shutdown is critical for any
> > >> administrator that performs any manipulations with cluster topology.
> > >> Essential information should be easily accessed. We shouldn't force
> > >> users to configure external tools and write extra code for basic
> things.
> > >>
> > >> Alex,
> > >>
> > >> Thanks, that's exact metric we need.
> > >> My point is that we should make it more accessible: via control.sh
> > >> command and single method for the whole cluster.
> > >>
> > >> Best Regards,
> > >> Ivan Rakov
> > >>
> > >> On 04.10.2019 16:34, Alex Plehanov wrote:
> > >>> Ivan, there already exist metric
> > >>> CacheGroupMetricsMXBean#getMinimumNumberOfPartitionCopies, which
> shows
> > the
> > >>> current redundancy level for the cache group.
> > >>> We can lose up to ( getMinimumNumberOfPartitionCopies-1) nodes
> without
> > data
> > >>> loss in this cache group.
> > >>>
> > >>> пт, 4 окт. 2019 г. в 16:17, Ivan Rakov <ivan.glu...@gmail.com>:
> > >>>
> > >>>> Igniters,
> > >>>>
> > >>>> I've seen numerous requests to find out an easy way to check whether
> > is
> > >>>> it safe to turn off cluster node. As we know, in Ignite protection
> > from
> > >>>> sudden node shutdown is implemented through keeping several backup
> > >>>> copies of each partition. However, this guarantee can be weakened
> for
> > a
> > >>>> while in case cluster has recently experienced node restart and
> > >>>> rebalancing process is still in progress.
> > >>>> Example scenario is restarting nodes one by one in order to update a
> > >>>> local configuration parameter. User restarts one node and
> rebalancing
> > >>>> starts: when it will be completed, it will be safe to proceed
> (backup
> > >>>> count=1). However, there's no transparent way to determine whether
> > >>>> rebalancing is over.
> > >>>>    From my perspective, it would be very helpful to:
> > >>>> 1) Add information about rebalancing and number of free-to-go nodes
> to
> > >>>> ./control.sh --state command.
> > >>>> Examples of output:
> > >>>>
> > >>>>> Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> > >>>>> Cluster tag: new_tag
> > >>>>>
> > >>>>
> >
> --------------------------------------------------------------------------------
> > >>>>> Cluster is active
> > >>>>> All partitions are up-to-date.
> > >>>>> 3 node(s) can safely leave the cluster without partition loss.
> > >>>>> Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> > >>>>> Cluster tag: new_tag
> > >>>>>
> > >>>>
> >
> --------------------------------------------------------------------------------
> > >>>>> Cluster is active
> > >>>>> Rebalancing is in progress.
> > >>>>> 1 node(s) can safely leave the cluster without partition loss.
> > >>>> 2) Provide the same information via ClusterMetrics. For example:
> > >>>> ClusterMetrics#isRebalanceInProgress // boolean
> > >>>> ClusterMetrics#getSafeToLeaveNodesCount // int
> > >>>>
> > >>>> Here I need to mention that this information can be calculated from
> > >>>> existing rebalance metrics (see CacheMetrics#*rebalance*). However,
> I
> > >>>> still think that we need more simple and understandable flag whether
> > >>>> cluster is in danger of data loss. Another point is that current
> > metrics
> > >>>> are bound to specific cache, which makes this information even
> harder
> > to
> > >>>> analyze.
> > >>>>
> > >>>> Thoughts?
> > >>>>
> > >>>> --
> > >>>> Best Regards,
> > >>>> Ivan Rakov
> > >>>>
> > >>>>
> >
>

Re: Metric showing how many nodes may safely leave the cluster

Reply via email to