Re: Metric showing how many nodes may safely leave the cluster

Alexey Goncharuk Fri, 04 Oct 2019 09:34:11 -0700

I agree that we should have the ability to read any metric using simple
Ignite tooling. I am not sure if visor.sh is a good fit - if I
remember correctly, it will start a daemon node which will bump the
topology version with all related consequences. I believe in the long term
it will beneficial to migrate all visor.sh functionality to a more
lightweight protocol, such as used in control.sh.


As for the metrics, the metric suggested by Ivan totally makes sense to me
- it is a simple and, actually, quite critical metric. It will be
completely unusable to select a minimum of some metric for all cache groups
manually. A monitoring system, on the other hand, might not be available
when the metric is needed, or may not support aggregation.

--AG

пт, 4 окт. 2019 г. в 18:58, Ivan Rakov <ivan.glu...@gmail.com>:

> Nikolay,
>
> Many users start to use Ignite with a small project without
> production-level monitoring. When proof-of-concept appears to be viable,
> they tend to expand Ignite usage by growing cluster and adding needed
> environment (including monitoring systems).
> Inability to find such basic thing as survival in case of next node
> crash may affect overall product impression. We all want Ignite to be
> successful and widespread.
>
> > Can you clarify, what do you mean, exactly?
>
> Right now user can access metric mentioned by Alex and choose minimum of
> all cache groups. I want to highlight that not every user understands
> Ignite and its internals so much to find out that exactly these sequence
> of actions will bring him to desired answer.
>
> > Can you clarify, what do you mean, exactly?
> > We have a ticket[1] to support metrics output via visor.sh.
> >
> > My understanding: we should have an easy way to output metric values for
> each node in cluster.
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-12191
> I propose to add metric method for aggregated
> "getMinimumNumberOfPartitionCopies" and expose it to control.sh.
> My understanding: it's result is critical enough to be accessible in a
> short path. I've started this topic due to request from user list, and
> I've heard many similar complaints before.
>
> Best Regards,
> Ivan Rakov
>
> On 04.10.2019 17:18, Nikolay Izhikov wrote:
> > Ivan.
> >
> >> We shouldn't force users to configure external tools and write extra
> code for basic things.
> > Actually, I don't agree with you.
> > Having external monitoring system for any production cluster is a
> *basic* thing.
> >
> > Can you, please, define "basic things"?
> >
> >> single method for the whole cluster
> > Can you clarify, what do you mean, exactly?
> > We have a ticket[1] to support metrics output via visor.sh.
> >
> > My understanding: we should have an easy way to output metric values for
> each node in cluster.
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-12191
> >
> >
> > В Пт, 04/10/2019 в 17:09 +0300, Ivan Rakov пишет:
> >> Max,
> >>
> >> What if user simply don't have configured monitoring system?
> >> Knowing whether cluster will survive node shutdown is critical for any
> >> administrator that performs any manipulations with cluster topology.
> >> Essential information should be easily accessed. We shouldn't force
> >> users to configure external tools and write extra code for basic things.
> >>
> >> Alex,
> >>
> >> Thanks, that's exact metric we need.
> >> My point is that we should make it more accessible: via control.sh
> >> command and single method for the whole cluster.
> >>
> >> Best Regards,
> >> Ivan Rakov
> >>
> >> On 04.10.2019 16:34, Alex Plehanov wrote:
> >>> Ivan, there already exist metric
> >>> CacheGroupMetricsMXBean#getMinimumNumberOfPartitionCopies, which shows
> the
> >>> current redundancy level for the cache group.
> >>> We can lose up to ( getMinimumNumberOfPartitionCopies-1) nodes without
> data
> >>> loss in this cache group.
> >>>
> >>> пт, 4 окт. 2019 г. в 16:17, Ivan Rakov <ivan.glu...@gmail.com>:
> >>>
> >>>> Igniters,
> >>>>
> >>>> I've seen numerous requests to find out an easy way to check whether
> is
> >>>> it safe to turn off cluster node. As we know, in Ignite protection
> from
> >>>> sudden node shutdown is implemented through keeping several backup
> >>>> copies of each partition. However, this guarantee can be weakened for
> a
> >>>> while in case cluster has recently experienced node restart and
> >>>> rebalancing process is still in progress.
> >>>> Example scenario is restarting nodes one by one in order to update a
> >>>> local configuration parameter. User restarts one node and rebalancing
> >>>> starts: when it will be completed, it will be safe to proceed (backup
> >>>> count=1). However, there's no transparent way to determine whether
> >>>> rebalancing is over.
> >>>>    From my perspective, it would be very helpful to:
> >>>> 1) Add information about rebalancing and number of free-to-go nodes to
> >>>> ./control.sh --state command.
> >>>> Examples of output:
> >>>>
> >>>>> Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> >>>>> Cluster tag: new_tag
> >>>>>
> >>>>
> --------------------------------------------------------------------------------
> >>>>> Cluster is active
> >>>>> All partitions are up-to-date.
> >>>>> 3 node(s) can safely leave the cluster without partition loss.
> >>>>> Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> >>>>> Cluster tag: new_tag
> >>>>>
> >>>>
> --------------------------------------------------------------------------------
> >>>>> Cluster is active
> >>>>> Rebalancing is in progress.
> >>>>> 1 node(s) can safely leave the cluster without partition loss.
> >>>> 2) Provide the same information via ClusterMetrics. For example:
> >>>> ClusterMetrics#isRebalanceInProgress // boolean
> >>>> ClusterMetrics#getSafeToLeaveNodesCount // int
> >>>>
> >>>> Here I need to mention that this information can be calculated from
> >>>> existing rebalance metrics (see CacheMetrics#*rebalance*). However, I
> >>>> still think that we need more simple and understandable flag whether
> >>>> cluster is in danger of data loss. Another point is that current
> metrics
> >>>> are bound to specific cache, which makes this information even harder
> to
> >>>> analyze.
> >>>>
> >>>> Thoughts?
> >>>>
> >>>> --
> >>>> Best Regards,
> >>>> Ivan Rakov
> >>>>
> >>>>
>

Re: Metric showing how many nodes may safely leave the cluster

Reply via email to