Ivan,

1. I think the rebalance cache metrics should be deprecated and
removed (someday). Here is the [1] issue to do such things.

2. I think #isRebalanceInProgress can and should be calculated by an
external monitoring system from local nodes based on
#localMovingPartitionsCount > 0 (or the more precise value
rebalancingPartitionsLeft from the issue [1]) values gathered from
each online node. Also, we should provide such templates for each
monitoring system (Zabbix, Prometheus etc.).

[1] https://issues.apache.org/jira/browse/IGNITE-12183

On Fri, 4 Oct 2019 at 16:17, Ivan Rakov <ivan.glu...@gmail.com> wrote:
>
> Igniters,
>
> I've seen numerous requests to find out an easy way to check whether is
> it safe to turn off cluster node. As we know, in Ignite protection from
> sudden node shutdown is implemented through keeping several backup
> copies of each partition. However, this guarantee can be weakened for a
> while in case cluster has recently experienced node restart and
> rebalancing process is still in progress.
> Example scenario is restarting nodes one by one in order to update a
> local configuration parameter. User restarts one node and rebalancing
> starts: when it will be completed, it will be safe to proceed (backup
> count=1). However, there's no transparent way to determine whether
> rebalancing is over.
>  From my perspective, it would be very helpful to:
> 1) Add information about rebalancing and number of free-to-go nodes to
> ./control.sh --state command.
> Examples of output:
>
> > Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> > Cluster tag: new_tag
> > --------------------------------------------------------------------------------
> > Cluster is active
> > All partitions are up-to-date.
> > 3 node(s) can safely leave the cluster without partition loss.
> > Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> > Cluster tag: new_tag
> > --------------------------------------------------------------------------------
> > Cluster is active
> > Rebalancing is in progress.
> > 1 node(s) can safely leave the cluster without partition loss.
> 2) Provide the same information via ClusterMetrics. For example:
> ClusterMetrics#isRebalanceInProgress // boolean
> ClusterMetrics#getSafeToLeaveNodesCount // int
>
> Here I need to mention that this information can be calculated from
> existing rebalance metrics (see CacheMetrics#*rebalance*). However, I
> still think that we need more simple and understandable flag whether
> cluster is in danger of data loss. Another point is that current metrics
> are bound to specific cache, which makes this information even harder to
> analyze.
>
> Thoughts?
>
> --
> Best Regards,
> Ivan Rakov
>

Reply via email to