I agree that we should have the ability to read any metric using simple Ignite tooling. I am not sure if visor.sh is a good fit - if I remember correctly, it will start a daemon node which will bump the topology version with all related consequences. I believe in the long term it will beneficial to migrate all visor.sh functionality to a more lightweight protocol, such as used in control.sh.
As for the metrics, the metric suggested by Ivan totally makes sense to me - it is a simple and, actually, quite critical metric. It will be completely unusable to select a minimum of some metric for all cache groups manually. A monitoring system, on the other hand, might not be available when the metric is needed, or may not support aggregation. --AG пт, 4 окт. 2019 г. в 18:58, Ivan Rakov <ivan.glu...@gmail.com>: > Nikolay, > > Many users start to use Ignite with a small project without > production-level monitoring. When proof-of-concept appears to be viable, > they tend to expand Ignite usage by growing cluster and adding needed > environment (including monitoring systems). > Inability to find such basic thing as survival in case of next node > crash may affect overall product impression. We all want Ignite to be > successful and widespread. > > > Can you clarify, what do you mean, exactly? > > Right now user can access metric mentioned by Alex and choose minimum of > all cache groups. I want to highlight that not every user understands > Ignite and its internals so much to find out that exactly these sequence > of actions will bring him to desired answer. > > > Can you clarify, what do you mean, exactly? > > We have a ticket[1] to support metrics output via visor.sh. > > > > My understanding: we should have an easy way to output metric values for > each node in cluster. > > > > [1] https://issues.apache.org/jira/browse/IGNITE-12191 > I propose to add metric method for aggregated > "getMinimumNumberOfPartitionCopies" and expose it to control.sh. > My understanding: it's result is critical enough to be accessible in a > short path. I've started this topic due to request from user list, and > I've heard many similar complaints before. > > Best Regards, > Ivan Rakov > > On 04.10.2019 17:18, Nikolay Izhikov wrote: > > Ivan. > > > >> We shouldn't force users to configure external tools and write extra > code for basic things. > > Actually, I don't agree with you. > > Having external monitoring system for any production cluster is a > *basic* thing. > > > > Can you, please, define "basic things"? > > > >> single method for the whole cluster > > Can you clarify, what do you mean, exactly? > > We have a ticket[1] to support metrics output via visor.sh. > > > > My understanding: we should have an easy way to output metric values for > each node in cluster. > > > > [1] https://issues.apache.org/jira/browse/IGNITE-12191 > > > > > > В Пт, 04/10/2019 в 17:09 +0300, Ivan Rakov пишет: > >> Max, > >> > >> What if user simply don't have configured monitoring system? > >> Knowing whether cluster will survive node shutdown is critical for any > >> administrator that performs any manipulations with cluster topology. > >> Essential information should be easily accessed. We shouldn't force > >> users to configure external tools and write extra code for basic things. > >> > >> Alex, > >> > >> Thanks, that's exact metric we need. > >> My point is that we should make it more accessible: via control.sh > >> command and single method for the whole cluster. > >> > >> Best Regards, > >> Ivan Rakov > >> > >> On 04.10.2019 16:34, Alex Plehanov wrote: > >>> Ivan, there already exist metric > >>> CacheGroupMetricsMXBean#getMinimumNumberOfPartitionCopies, which shows > the > >>> current redundancy level for the cache group. > >>> We can lose up to ( getMinimumNumberOfPartitionCopies-1) nodes without > data > >>> loss in this cache group. > >>> > >>> пт, 4 окт. 2019 г. в 16:17, Ivan Rakov <ivan.glu...@gmail.com>: > >>> > >>>> Igniters, > >>>> > >>>> I've seen numerous requests to find out an easy way to check whether > is > >>>> it safe to turn off cluster node. As we know, in Ignite protection > from > >>>> sudden node shutdown is implemented through keeping several backup > >>>> copies of each partition. However, this guarantee can be weakened for > a > >>>> while in case cluster has recently experienced node restart and > >>>> rebalancing process is still in progress. > >>>> Example scenario is restarting nodes one by one in order to update a > >>>> local configuration parameter. User restarts one node and rebalancing > >>>> starts: when it will be completed, it will be safe to proceed (backup > >>>> count=1). However, there's no transparent way to determine whether > >>>> rebalancing is over. > >>>> From my perspective, it would be very helpful to: > >>>> 1) Add information about rebalancing and number of free-to-go nodes to > >>>> ./control.sh --state command. > >>>> Examples of output: > >>>> > >>>>> Cluster ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc > >>>>> Cluster tag: new_tag > >>>>> > >>>> > -------------------------------------------------------------------------------- > >>>>> Cluster is active > >>>>> All partitions are up-to-date. > >>>>> 3 node(s) can safely leave the cluster without partition loss. > >>>>> Cluster ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc > >>>>> Cluster tag: new_tag > >>>>> > >>>> > -------------------------------------------------------------------------------- > >>>>> Cluster is active > >>>>> Rebalancing is in progress. > >>>>> 1 node(s) can safely leave the cluster without partition loss. > >>>> 2) Provide the same information via ClusterMetrics. For example: > >>>> ClusterMetrics#isRebalanceInProgress // boolean > >>>> ClusterMetrics#getSafeToLeaveNodesCount // int > >>>> > >>>> Here I need to mention that this information can be calculated from > >>>> existing rebalance metrics (see CacheMetrics#*rebalance*). However, I > >>>> still think that we need more simple and understandable flag whether > >>>> cluster is in danger of data loss. Another point is that current > metrics > >>>> are bound to specific cache, which makes this information even harder > to > >>>> analyze. > >>>> > >>>> Thoughts? > >>>> > >>>> -- > >>>> Best Regards, > >>>> Ivan Rakov > >>>> > >>>> >