Ivan, 1. I think the rebalance cache metrics should be deprecated and removed (someday). Here is the [1] issue to do such things.
2. I think #isRebalanceInProgress can and should be calculated by an external monitoring system from local nodes based on #localMovingPartitionsCount > 0 (or the more precise value rebalancingPartitionsLeft from the issue [1]) values gathered from each online node. Also, we should provide such templates for each monitoring system (Zabbix, Prometheus etc.). [1] https://issues.apache.org/jira/browse/IGNITE-12183 On Fri, 4 Oct 2019 at 16:17, Ivan Rakov <ivan.glu...@gmail.com> wrote: > > Igniters, > > I've seen numerous requests to find out an easy way to check whether is > it safe to turn off cluster node. As we know, in Ignite protection from > sudden node shutdown is implemented through keeping several backup > copies of each partition. However, this guarantee can be weakened for a > while in case cluster has recently experienced node restart and > rebalancing process is still in progress. > Example scenario is restarting nodes one by one in order to update a > local configuration parameter. User restarts one node and rebalancing > starts: when it will be completed, it will be safe to proceed (backup > count=1). However, there's no transparent way to determine whether > rebalancing is over. > From my perspective, it would be very helpful to: > 1) Add information about rebalancing and number of free-to-go nodes to > ./control.sh --state command. > Examples of output: > > > Cluster ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc > > Cluster tag: new_tag > > -------------------------------------------------------------------------------- > > Cluster is active > > All partitions are up-to-date. > > 3 node(s) can safely leave the cluster without partition loss. > > Cluster ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc > > Cluster tag: new_tag > > -------------------------------------------------------------------------------- > > Cluster is active > > Rebalancing is in progress. > > 1 node(s) can safely leave the cluster without partition loss. > 2) Provide the same information via ClusterMetrics. For example: > ClusterMetrics#isRebalanceInProgress // boolean > ClusterMetrics#getSafeToLeaveNodesCount // int > > Here I need to mention that this information can be calculated from > existing rebalance metrics (see CacheMetrics#*rebalance*). However, I > still think that we need more simple and understandable flag whether > cluster is in danger of data loss. Another point is that current metrics > are bound to specific cache, which makes this information even harder to > analyze. > > Thoughts? > > -- > Best Regards, > Ivan Rakov >