Re: [DISCUSSION] Add index rebuild time metrics

Ivan Rakov Wed, 12 Aug 2020 01:32:43 -0700

I seem to be in the minority here :)
Fine, let's make it as clear as possible which metric method
(localCacheSize) should be called in order to retrieve a 100% progress
milestone.
I've left comments in the PR.


On Tue, Aug 11, 2020 at 4:31 PM Nikolay Izhikov <nizhi...@apache.org> wrote:

> > I propose to stick with a cache-group level metric (e.g.
> getIndexBuildProgress)
>
> +1
>
> > that returns a float from 0 to 1, which is calculated as [processedKeys]
> / [localCacheSize].
>
> From my point of view, we shouldn’t do calculations on the Ignite side if
> we can avoid it.
> I’d rather provide two separate metrics - processedKeys and localCacheSize.
>
> > 11 авг. 2020 г., в 16:26, Ivan Rakov <ivan.glu...@gmail.com> написал(а):
> >
> >>
> >> As a compromise, I can add jmx methods (rebuilding indexes in the
> process
> >> and the percentage of rebuilding) for the entire node, but I tried to
> find
> >> a suitable place and did not find it, tell me where to add it?
> >
> > I have checked existing JMX beans. To be honest, I struggle to find a
> > suitable place as well.
> > We have ClusterMetrics that may represent the state of a local node, but
> > this class is also used for aggregated cluster metrics. I can't propose a
> > reasonable way to merge percentages from different nodes.
> > On the other hand, total index rebuild for all caches isn't a common
> > scenario. It's either performed after manual index.bin removal or after
> > index creation, both operations are performed on cache / cache-group
> level.
> > Also, all other similar metrics are provided on cache-group level.
> >
> > I propose to stick with a cache-group level metric (e.g.
> > getIndexBuildProgress) that returns a float from 0 to 1, which is
> > calculated as [processedKeys] / [localCacheSize]. Even if a user handles
> > metrics through Zabbix, I anticipate that he'll perform this calculation
> on
> > his own in order to estimate progress. Let's help him a bit and perform
> it
> > on the system side.
> > If a per-group percentage metric is present, I
> > think getIndexRebuildKeyProcessed becomes redundant.
> >
> > On Tue, Aug 11, 2020 at 8:20 AM ткаленко кирилл <tkalkir...@yandex.ru>
> > wrote:
> >
> >> Hi, Ivan!
> >>
> >> What precision would be sufficient?
> >>> If the progress is very slow, I don't see issues with tracking it if
> the
> >>> percentage float has enough precision.
> >>
> >> I think we can add a mention getting cache size.
> >>> 1. Gain an understanding that local cache size
> >>> (CacheMetricsImpl#getCacheSize) should be used as a 100% milestone (it
> >>> isn't mentioned neither in javadoc nor in JMX method description).
> >>
> >> Do you think users collect metrics with their hands? I think this is
> done
> >> by other systems, such as zabbix.
> >>> 2. Manually calculate sum of all metrics and divide to sum of all cache
> >>> sizes.
> >>
> >> As a compromise, I can add jmx methods (rebuilding indexes in the
> process
> >> and the percentage of rebuilding) for the entire node, but I tried to
> find
> >> a suitable place and did not find it, tell me where to add it?
> >>> On the other hand, % of index rebuild progress is self-descriptive. I
> >> don't
> >>> understand why we tend to make user's life harder.
> >>
> >> 10.08.2020, 21:57, "Ivan Rakov" <ivan.glu...@gmail.com>:
> >>>> This metric can be used only for local node, to get size of cache use
> >>>>
> >>
> org.apache.ignite.internal.processors.cache.CacheMetricsImpl#getCacheSize.
> >>>
> >>> Got it, agree.
> >>>
> >>> If there is a lot of data in node that can be rebuilt, percentage may
> >>>> change very rarely and may not give an estimate of how much time is
> >> left.
> >>>> If we see for example that 50_000 keys are rebuilt once a minute, and
> >> we
> >>>> have 1_000_000_000 keys, then we can have an approximate estimate.
> >> What do
> >>>> you think of that?
> >>>
> >>> If the progress is very slow, I don't see issues with tracking it if
> the
> >>> percentage float has enough precision.
> >>> Still, usability of the metric concerns me. In order to estimate
> >> remaining
> >>> time of index rebuild, user should:
> >>> 1. Gain an understanding that local cache size
> >>> (CacheMetricsImpl#getCacheSize) should be used as a 100% milestone (it
> >>> isn't mentioned neither in javadoc nor in JMX method description).
> >>> 2. Manually calculate sum of all metrics and divide to sum of all cache
> >>> sizes.
> >>> On the other hand, % of index rebuild progress is self-descriptive. I
> >> don't
> >>> understand why we tend to make user's life harder.
> >>>
> >>> --
> >>> Best regards,
> >>> Ivan
> >>>
> >>> On Mon, Aug 10, 2020 at 8:53 PM ткаленко кирилл <tkalkir...@yandex.ru>
> >>> wrote:
> >>>
> >>>> Hi, Ivan!
> >>>>
> >>>> For this you can use
> >>>> org.apache.ignite.cache.CacheMetrics#IsIndexRebuildInProgress
> >>>>> How can a local number of processed keys can help us to understand
> >> when
> >>>>> index rebuild will be finished?
> >>>>
> >>>> This metric can be used only for local node, to get size of cache use
> >>>>
> >>
> org.apache.ignite.internal.processors.cache.CacheMetricsImpl#getCacheSize.
> >>>>> We can't compare metric value with cache.size(). First one is
> >> node-local,
> >>>>> while cache size covers all partitions in the cluster.
> >>>>
> >>>> If there is a lot of data in node that can be rebuilt, percentage may
> >>>> change very rarely and may not give an estimate of how much time is
> >> left.
> >>>> If we see for example that 50_000 keys are rebuilt once a minute, and
> >> we
> >>>> have 1_000_000_000 keys, then we can have an approximate estimate.
> >> What do
> >>>> you think of that?
> >>>>> I find one single metric much more usable. It would be perfect if
> >> metric
> >>>>> value is represented in percentage, e.g. current progress of local
> >> node
> >>>>> index rebuild is 60%.
> >>>>
> >>>> 10.08.2020, 19:11, "Ivan Rakov" <ivan.glu...@gmail.com>:
> >>>>> Folks,
> >>>>>
> >>>>> Sorry for coming late to the party. I've taken a look at this issue
> >>>> during
> >>>>> review.
> >>>>>
> >>>>> How can a local number of processed keys can help us to understand
> >> when
> >>>>> index rebuild will be finished?
> >>>>> We can't compare metric value with cache.size(). First one is
> >> node-local,
> >>>>> while cache size covers all partitions in the cluster.
> >>>>> Also, I don't understand why we need to keep separate metrics for all
> >>>>> caches. Of course, the metric becomes more fair, but obviously
> >> harder to
> >>>>> make conclusions on whether "the index rebuild" process is over (and
> >> the
> >>>>> cluster is ready to process queries quickly).
> >>>>>
> >>>>> I find one single metric much more usable. It would be perfect if
> >> metric
> >>>>> value is represented in percentage, e.g. current progress of local
> >> node
> >>>>> index rebuild is 60%.
> >>>>>
> >>>>> --
> >>>>> Best regards,
> >>>>> Ivan
> >>>>>
> >>>>> On Fri, Jul 24, 2020 at 1:35 PM Stanislav Lukyanov <
> >>>> stanlukya...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Got it. I thought that index building and index rebuilding are
> >>>> essentially
> >>>>>> the same,
> >>>>>> but now I see that they are different: index rebuilding cares about
> >> all
> >>>>>> indexes at once while index building cares about particular ones.
> >>>>>>
> >>>>>> Kirill's approach sounds good.
> >>>>>>
> >>>>>> Stan
> >>>>>>
> >>>>>>> On 20 Jul 2020, at 14:54, Alexey Goncharuk <
> >>>> alexey.goncha...@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Stan,
> >>>>>>>
> >>>>>>> Currently we never build indexes one-by-one - we always use a
> >> cache
> >>>> data
> >>>>>>> row visitor which either updates all indexes (see
> >>>>>> IndexRebuildFullClosure)
> >>>>>>> or updates a set of all indexes that need to catch up (see
> >>>>>>> IndexRebuildPartialClosure). GIven that, I do not see any need for
> >>>>>>> per-index rebuild status as this status will be updated for all
> >>>> outdated
> >>>>>>> indexes simultaneously.
> >>>>>>>
> >>>>>>> Kirill's approach for the total number of processed keys per cache
> >>>> seems
> >>>>>>> reasonable to me.
> >>>>>>>
> >>>>>>> --AG
> >>>>>>>
> >>>>>>> пт, 3 июл. 2020 г. в 10:12, ткаленко кирилл <tkalkir...@yandex.ru
> >>> :
> >>>>>>>
> >>>>>>>> Hi, Stan!
> >>>>>>>>
> >>>>>>>> Perhaps it is worth clarifying what exactly I wanted to say.
> >>>>>>>> Now we have 2 processes: building and rebuilding indexes.
> >>>>>>>>
> >>>>>>>> At moment, we have some metrics for rebuilding indexes:
> >>>>>>>> "IsIndexRebuildInProgress", "IndexBuildCountPartitionsLeft".
> >>>>>>>>
> >>>>>>>> I suggest adding another metric "Indexrebuildkeyprocessed", which
> >>>> will
> >>>>>>>> allow you to determine how many records are left to rebuild for
> >>>> cache.
> >>>>>>>>
> >>>>>>>> I think your comments are more about building an index that may
> >> need
> >>>>>> more
> >>>>>>>> metrics, but I think you should do it in a separate ticket.
> >>>>>>>>
> >>>>>>>> 03.07.2020, 03:09, "Stanislav Lukyanov" <stanlukya...@gmail.com
> >>> :
> >>>>>>>>> If multiple indexes are to be built "number of indexed keys"
> >>>> metric may
> >>>>>>>> be misleading.
> >>>>>>>>>
> >>>>>>>>> As a cluster admin, I'd like to know:
> >>>>>>>>> - Are all indexes ready on a node?
> >>>>>>>>> - How many indexes are to be built?
> >>>>>>>>> - How much resources are used by the index building (how many
> >>>> threads
> >>>>>>>> are used)?
> >>>>>>>>> - Which index(es?) is being built right now?
> >>>>>>>>> - How much time until the current (single) index building
> >> finishes?
> >>>>>> Here
> >>>>>>>> "time" can be a lot of things: partitions, entries, percent of
> >> the
> >>>>>> cache,
> >>>>>>>> minutes and hours
> >>>>>>>>> - How much time until all indexes are built?
> >>>>>>>>> - How much does it take to build each of my indexes / a single
> >>>> index of
> >>>>>>>> my cache on average?
> >>>>>>>>>
> >>>>>>>>> I think we need a set of metrics and/or log messages to solve
> >> all
> >>>> of
> >>>>>>>> these questions.
> >>>>>>>>> I imaging something like:
> >>>>>>>>> - numberOfIndexesToBuild
> >>>>>>>>> - a standard set of metrics on the index building thread pool
> >> (do
> >>>> we
> >>>>>>>> already have it?)
> >>>>>>>>> - currentlyBuiltIndexName (assuming we only build one at a time
> >>>> which
> >>>>>> is
> >>>>>>>> probably not true)
> >>>>>>>>> - for the "time" metrics I think percentage might be the best as
> >>>> it's
> >>>>>>>> the easiest to understand; we may add multiple metrics though.
> >>>>>>>>> - For "time per each index" I'd add detailed log messages
> >> stating
> >>>> how
> >>>>>>>> long did it take to build a particular index
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Stan
> >>>>>>>>>
> >>>>>>>>>> On 26 Jun 2020, at 12:49, ткаленко кирилл <
> >> tkalkir...@yandex.ru>
> >>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi, Igniters.
> >>>>>>>>>>
> >>>>>>>>>> I would like to know if it is possible to estimate how much the
> >>>> index
> >>>>>>>> rebuild will take?
> >>>>>>>>>>
> >>>>>>>>>> At the moment, I have found the following metrics [1] and [2]
> >> and
> >>>>>>>> since the rebuild is based on caches, I think it would be useful
> >> to
> >>>> know
> >>>>>>>> how many records are processed in indexing. This way we can
> >>>> estimate how
> >>>>>>>> long we have to wait for the index to be rebuilt by subtracting
> >> [3]
> >>>> and
> >>>>>> how
> >>>>>>>> many records are indexed.
> >>>>>>>>>>
> >>>>>>>>>> I think we should add this metric [4].
> >>>>>>>>>>
> >>>>>>>>>> Comments, suggestions?
> >>>>>>>>>>
> >>>>>>>>>> [1] - https://issues.apache.org/jira/browse/IGNITE-12184
> >>>>>>>>>> [2] -
> >>>>>>>>
> >>>>>>
> >>>>
> >>
> org.apache.ignite.internal.processors.cache.CacheGroupMetricsImpl#idxBuildCntPartitionsLeft
> >>>>>>>>>> [3] - org.apache.ignite.cache.CacheMetrics#getCacheSize
> >>>>>>>>>> [4] - org.apache.ignite.cache.CacheMetrics#getNumberIndexedKeys
> >>>>>>>>
> >>
>
>

Re: [DISCUSSION] Add index rebuild time metrics

Reply via email to