I'm for the proposal to add new JMX metrics and enhance the existing tooling. But I would encourage us to integrate this into the new metrics framework Nikolay has been working on. Otherwise, we will be deprecating these JMX metrics in a short time frame in favor of the new monitoring APIs.
- Denis On Fri, Oct 4, 2019 at 9:33 AM Alexey Goncharuk <alexey.goncha...@gmail.com> wrote: > I agree that we should have the ability to read any metric using simple > Ignite tooling. I am not sure if visor.sh is a good fit - if I > remember correctly, it will start a daemon node which will bump the > topology version with all related consequences. I believe in the long term > it will beneficial to migrate all visor.sh functionality to a more > lightweight protocol, such as used in control.sh. > > As for the metrics, the metric suggested by Ivan totally makes sense to me > - it is a simple and, actually, quite critical metric. It will be > completely unusable to select a minimum of some metric for all cache groups > manually. A monitoring system, on the other hand, might not be available > when the metric is needed, or may not support aggregation. > > --AG > > пт, 4 окт. 2019 г. в 18:58, Ivan Rakov <ivan.glu...@gmail.com>: > > > Nikolay, > > > > Many users start to use Ignite with a small project without > > production-level monitoring. When proof-of-concept appears to be viable, > > they tend to expand Ignite usage by growing cluster and adding needed > > environment (including monitoring systems). > > Inability to find such basic thing as survival in case of next node > > crash may affect overall product impression. We all want Ignite to be > > successful and widespread. > > > > > Can you clarify, what do you mean, exactly? > > > > Right now user can access metric mentioned by Alex and choose minimum of > > all cache groups. I want to highlight that not every user understands > > Ignite and its internals so much to find out that exactly these sequence > > of actions will bring him to desired answer. > > > > > Can you clarify, what do you mean, exactly? > > > We have a ticket[1] to support metrics output via visor.sh. > > > > > > My understanding: we should have an easy way to output metric values > for > > each node in cluster. > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-12191 > > I propose to add metric method for aggregated > > "getMinimumNumberOfPartitionCopies" and expose it to control.sh. > > My understanding: it's result is critical enough to be accessible in a > > short path. I've started this topic due to request from user list, and > > I've heard many similar complaints before. > > > > Best Regards, > > Ivan Rakov > > > > On 04.10.2019 17:18, Nikolay Izhikov wrote: > > > Ivan. > > > > > >> We shouldn't force users to configure external tools and write extra > > code for basic things. > > > Actually, I don't agree with you. > > > Having external monitoring system for any production cluster is a > > *basic* thing. > > > > > > Can you, please, define "basic things"? > > > > > >> single method for the whole cluster > > > Can you clarify, what do you mean, exactly? > > > We have a ticket[1] to support metrics output via visor.sh. > > > > > > My understanding: we should have an easy way to output metric values > for > > each node in cluster. > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-12191 > > > > > > > > > В Пт, 04/10/2019 в 17:09 +0300, Ivan Rakov пишет: > > >> Max, > > >> > > >> What if user simply don't have configured monitoring system? > > >> Knowing whether cluster will survive node shutdown is critical for any > > >> administrator that performs any manipulations with cluster topology. > > >> Essential information should be easily accessed. We shouldn't force > > >> users to configure external tools and write extra code for basic > things. > > >> > > >> Alex, > > >> > > >> Thanks, that's exact metric we need. > > >> My point is that we should make it more accessible: via control.sh > > >> command and single method for the whole cluster. > > >> > > >> Best Regards, > > >> Ivan Rakov > > >> > > >> On 04.10.2019 16:34, Alex Plehanov wrote: > > >>> Ivan, there already exist metric > > >>> CacheGroupMetricsMXBean#getMinimumNumberOfPartitionCopies, which > shows > > the > > >>> current redundancy level for the cache group. > > >>> We can lose up to ( getMinimumNumberOfPartitionCopies-1) nodes > without > > data > > >>> loss in this cache group. > > >>> > > >>> пт, 4 окт. 2019 г. в 16:17, Ivan Rakov <ivan.glu...@gmail.com>: > > >>> > > >>>> Igniters, > > >>>> > > >>>> I've seen numerous requests to find out an easy way to check whether > > is > > >>>> it safe to turn off cluster node. As we know, in Ignite protection > > from > > >>>> sudden node shutdown is implemented through keeping several backup > > >>>> copies of each partition. However, this guarantee can be weakened > for > > a > > >>>> while in case cluster has recently experienced node restart and > > >>>> rebalancing process is still in progress. > > >>>> Example scenario is restarting nodes one by one in order to update a > > >>>> local configuration parameter. User restarts one node and > rebalancing > > >>>> starts: when it will be completed, it will be safe to proceed > (backup > > >>>> count=1). However, there's no transparent way to determine whether > > >>>> rebalancing is over. > > >>>> From my perspective, it would be very helpful to: > > >>>> 1) Add information about rebalancing and number of free-to-go nodes > to > > >>>> ./control.sh --state command. > > >>>> Examples of output: > > >>>> > > >>>>> Cluster ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc > > >>>>> Cluster tag: new_tag > > >>>>> > > >>>> > > > -------------------------------------------------------------------------------- > > >>>>> Cluster is active > > >>>>> All partitions are up-to-date. > > >>>>> 3 node(s) can safely leave the cluster without partition loss. > > >>>>> Cluster ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc > > >>>>> Cluster tag: new_tag > > >>>>> > > >>>> > > > -------------------------------------------------------------------------------- > > >>>>> Cluster is active > > >>>>> Rebalancing is in progress. > > >>>>> 1 node(s) can safely leave the cluster without partition loss. > > >>>> 2) Provide the same information via ClusterMetrics. For example: > > >>>> ClusterMetrics#isRebalanceInProgress // boolean > > >>>> ClusterMetrics#getSafeToLeaveNodesCount // int > > >>>> > > >>>> Here I need to mention that this information can be calculated from > > >>>> existing rebalance metrics (see CacheMetrics#*rebalance*). However, > I > > >>>> still think that we need more simple and understandable flag whether > > >>>> cluster is in danger of data loss. Another point is that current > > metrics > > >>>> are bound to specific cache, which makes this information even > harder > > to > > >>>> analyze. > > >>>> > > >>>> Thoughts? > > >>>> > > >>>> -- > > >>>> Best Regards, > > >>>> Ivan Rakov > > >>>> > > >>>> > > >