Ivan. > We shouldn't force users to configure external tools and write extra code for > basic things.
Actually, I don't agree with you. Having external monitoring system for any production cluster is a *basic* thing. Can you, please, define "basic things"? > single method for the whole cluster Can you clarify, what do you mean, exactly? We have a ticket[1] to support metrics output via visor.sh. My understanding: we should have an easy way to output metric values for each node in cluster. [1] https://issues.apache.org/jira/browse/IGNITE-12191 В Пт, 04/10/2019 в 17:09 +0300, Ivan Rakov пишет: > Max, > > What if user simply don't have configured monitoring system? > Knowing whether cluster will survive node shutdown is critical for any > administrator that performs any manipulations with cluster topology. > Essential information should be easily accessed. We shouldn't force > users to configure external tools and write extra code for basic things. > > Alex, > > Thanks, that's exact metric we need. > My point is that we should make it more accessible: via control.sh > command and single method for the whole cluster. > > Best Regards, > Ivan Rakov > > On 04.10.2019 16:34, Alex Plehanov wrote: > > Ivan, there already exist metric > > CacheGroupMetricsMXBean#getMinimumNumberOfPartitionCopies, which shows the > > current redundancy level for the cache group. > > We can lose up to ( getMinimumNumberOfPartitionCopies-1) nodes without data > > loss in this cache group. > > > > пт, 4 окт. 2019 г. в 16:17, Ivan Rakov <ivan.glu...@gmail.com>: > > > > > Igniters, > > > > > > I've seen numerous requests to find out an easy way to check whether is > > > it safe to turn off cluster node. As we know, in Ignite protection from > > > sudden node shutdown is implemented through keeping several backup > > > copies of each partition. However, this guarantee can be weakened for a > > > while in case cluster has recently experienced node restart and > > > rebalancing process is still in progress. > > > Example scenario is restarting nodes one by one in order to update a > > > local configuration parameter. User restarts one node and rebalancing > > > starts: when it will be completed, it will be safe to proceed (backup > > > count=1). However, there's no transparent way to determine whether > > > rebalancing is over. > > > From my perspective, it would be very helpful to: > > > 1) Add information about rebalancing and number of free-to-go nodes to > > > ./control.sh --state command. > > > Examples of output: > > > > > > > Cluster ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc > > > > Cluster tag: new_tag > > > > > > > > > > -------------------------------------------------------------------------------- > > > > Cluster is active > > > > All partitions are up-to-date. > > > > 3 node(s) can safely leave the cluster without partition loss. > > > > Cluster ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc > > > > Cluster tag: new_tag > > > > > > > > > > -------------------------------------------------------------------------------- > > > > Cluster is active > > > > Rebalancing is in progress. > > > > 1 node(s) can safely leave the cluster without partition loss. > > > > > > 2) Provide the same information via ClusterMetrics. For example: > > > ClusterMetrics#isRebalanceInProgress // boolean > > > ClusterMetrics#getSafeToLeaveNodesCount // int > > > > > > Here I need to mention that this information can be calculated from > > > existing rebalance metrics (see CacheMetrics#*rebalance*). However, I > > > still think that we need more simple and understandable flag whether > > > cluster is in danger of data loss. Another point is that current metrics > > > are bound to specific cache, which makes this information even harder to > > > analyze. > > > > > > Thoughts? > > > > > > -- > > > Best Regards, > > > Ivan Rakov > > > > > >
signature.asc
Description: This is a digitally signed message part