Re: Metric showing how many nodes may safely leave the cluster

Nikolay Izhikov Fri, 04 Oct 2019 07:15:39 -0700

Ivan.

> We shouldn't force users to configure external tools and write extra code for 
> basic things.


Actually, I don't agree with you.
Having external monitoring system for any production cluster is a *basic* thing.

Can you, please, define "basic things"?

> single method for the whole cluster

Can you clarify, what do you mean, exactly?
We have a ticket[1] to support metrics output via visor.sh.

My understanding: we should have an easy way to output metric values for each 
node in cluster.

[1] https://issues.apache.org/jira/browse/IGNITE-12191


В Пт, 04/10/2019 в 17:09 +0300, Ivan Rakov пишет:
> Max,
> 
> What if user simply don't have configured monitoring system?
> Knowing whether cluster will survive node shutdown is critical for any 
> administrator that performs any manipulations with cluster topology.
> Essential information should be easily accessed. We shouldn't force 
> users to configure external tools and write extra code for basic things.
> 
> Alex,
> 
> Thanks, that's exact metric we need.
> My point is that we should make it more accessible: via control.sh 
> command and single method for the whole cluster.
> 
> Best Regards,
> Ivan Rakov
> 
> On 04.10.2019 16:34, Alex Plehanov wrote:
> > Ivan, there already exist metric
> > CacheGroupMetricsMXBean#getMinimumNumberOfPartitionCopies, which shows the
> > current redundancy level for the cache group.
> > We can lose up to ( getMinimumNumberOfPartitionCopies-1) nodes without data
> > loss in this cache group.
> > 
> > пт, 4 окт. 2019 г. в 16:17, Ivan Rakov <ivan.glu...@gmail.com>:
> > 
> > > Igniters,
> > > 
> > > I've seen numerous requests to find out an easy way to check whether is
> > > it safe to turn off cluster node. As we know, in Ignite protection from
> > > sudden node shutdown is implemented through keeping several backup
> > > copies of each partition. However, this guarantee can be weakened for a
> > > while in case cluster has recently experienced node restart and
> > > rebalancing process is still in progress.
> > > Example scenario is restarting nodes one by one in order to update a
> > > local configuration parameter. User restarts one node and rebalancing
> > > starts: when it will be completed, it will be safe to proceed (backup
> > > count=1). However, there's no transparent way to determine whether
> > > rebalancing is over.
> > >   From my perspective, it would be very helpful to:
> > > 1) Add information about rebalancing and number of free-to-go nodes to
> > > ./control.sh --state command.
> > > Examples of output:
> > > 
> > > > Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> > > > Cluster tag: new_tag
> > > > 
> > > 
> > > --------------------------------------------------------------------------------
> > > > Cluster is active
> > > > All partitions are up-to-date.
> > > > 3 node(s) can safely leave the cluster without partition loss.
> > > > Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> > > > Cluster tag: new_tag
> > > > 
> > > 
> > > --------------------------------------------------------------------------------
> > > > Cluster is active
> > > > Rebalancing is in progress.
> > > > 1 node(s) can safely leave the cluster without partition loss.
> > > 
> > > 2) Provide the same information via ClusterMetrics. For example:
> > > ClusterMetrics#isRebalanceInProgress // boolean
> > > ClusterMetrics#getSafeToLeaveNodesCount // int
> > > 
> > > Here I need to mention that this information can be calculated from
> > > existing rebalance metrics (see CacheMetrics#*rebalance*). However, I
> > > still think that we need more simple and understandable flag whether
> > > cluster is in danger of data loss. Another point is that current metrics
> > > are bound to specific cache, which makes this information even harder to
> > > analyze.
> > > 
> > > Thoughts?
> > > 
> > > --
> > > Best Regards,
> > > Ivan Rakov
> > > 
> > >

signature.asc
Description: This is a digitally signed message part

Re: Metric showing how many nodes may safely leave the cluster

Reply via email to