Folks, All previous suggestions have some disadvantages. It can be several exchanges between two metric updates and fast exchange can rewrite previous long exchange.
We can introduce a metric of total blocking duration that will accumulate at the end of the exchange. So, users will get actual information about how long operations were blocked. Cluster metric will be a maximum of local nodes metrics. And we need a boolean metric that will indicate realtime status. It needs because of duration metric updates at the end of the exchange. So I propose to change the current metric that not released to the totalCacheOperationsBlockingDuration metric and to add the isCacheOperationsBlocked metric. WDYT? пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov <a...@apache.org>: > > Nikolay, > > Still see no reason to replace boolean with long. > > On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov <nizhi...@apache.org> wrote: > > > Anton. > > > > 1. Value exported based on SPI settings, not in the moment it changed. > > > > 2. Clock synchronisation - if we export start time, we should also export > > node local timestamp. > > > > пн, 22 июля 2019 г., 8:33 Anton Vinogradov <a...@apache.org>: > > > > > Folks, > > > > > > What's the reason for duration counting? > > > AFAIU, it's a monitoring system feature to count the durations. > > > Sine monitoring system checks metrics periodically it will know the > > > duration by its own log. > > > > > > On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <jokse...@gmail.com> > > > wrote: > > > > > > > Nikita, > > > > > > > > Yes, I mean duration not timestamp. For the metric name, I suggest > > > > "cacheOperationsBlockingDuration", I think it cleaner represents what > > is > > > > blocked during PME. > > > > We can also combine both timestamp "cacheOperationsBlockingStartTs" and > > > > duration to have better correlation when cache operations were blocked > > > and > > > > how much time it's taken. > > > > For instant view (like in JMX bean) a calculated value as you mentioned > > > > can be used. > > > > For metrics are exported to some backend (IEP-35) a counter can be > > used. > > > > The counter is incremented by blocking time after blocking has ended. > > > > > > > > пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <nsamelc...@gmail.com>: > > > > > > > >> Pavel, > > > >> > > > >> The main purpose of this metric is > > > >> >> how much time we wait for resuming cache operations > > > >> > > > >> Seems I misunderstood you. Do you mean timestamp or duration here? > > > >> >> What do you think if we change the boolean value of metric to a > > long > > > >> value that represents time in milliseconds when operations were > > blocked? > > > >> > > > >> This time can be calculated as (currentTime - > > > >> timeSinceOperationsBlocked) in case of timestamp. > > > >> > > > >> Duration will be more understandable. It'll be something like > > > >> getCurrentBlockingPmeDuration. But I haven't come up with a better > > > >> name yet. > > > >> > > > >> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <jokse...@gmail.com>: > > > >> > > > > >> > Nikita, > > > >> > > > > >> > I think getCurrentPmeDuration doesn't show useful information. The > > > main > > > >> PME side effect for end-users is blocking cache operations. Not all > > PME > > > >> time blocks it. > > > >> > What information gives to an end-user timestamp of > > > >> "timeSinceOperationsBlocked"? For what analysis it can be used and > > how? > > > >> > > > > >> > пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <nsamelc...@gmail.com > > >: > > > >> >> > > > >> >> Hi Pavel, > > > >> >> > > > >> >> This time already can be obtained from the getCurrentPmeDuration > > and > > > >> >> new isOperationsBlockedByPme metrics. > > > >> >> > > > >> >> As an alternative solution, I can rework recently added > > > >> >> getCurrentPmeDuration metric (not released yet). Seems for users it > > > >> >> useless in case of non-blocking PME. > > > >> >> Lets name it timeSinceOperationsBlocked. It'll be timestamp when > > > >> >> blocking started (minimal value of cluster nodes) and 0 if blocking > > > >> >> ends (there is no running PME). > > > >> >> > > > >> >> WDYT? > > > >> >> > > > >> >> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <jokse...@gmail.com>: > > > >> >> > > > > >> >> > Hi Nikita, > > > >> >> > > > > >> >> > Thank you for working on this. What do you think if we change the > > > >> boolean > > > >> >> > value of metric to a long value that represents time in > > > milliseconds > > > >> when > > > >> >> > operations were blocked? > > > >> >> > Since we have not only JMX and now metrics are periodically > > > exported > > > >> to > > > >> >> > some backend it can give a more clear picture of how much time we > > > >> wait for > > > >> >> > resuming cache operations instead of instant boolean indicator. > > > >> >> > > > > >> >> > пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev < > > nsamelc...@gmail.com > > > >: > > > >> >> > > > > >> >> > > Anton, Nikolay, > > > >> >> > > > > > >> >> > > Thanks for the support. > > > >> >> > > > > > >> >> > > For now, we have the getCurrentPmeDuration() metric that does > > not > > > >> show > > > >> >> > > influence on the cluster correctly. PME can be without blocking > > > >> >> > > operations. For example, client node join/leave events. > > > >> >> > > > > > >> >> > > I suggest add new metric - isOperationsBlockedByPme(). > > Together, > > > >> these > > > >> >> > > metrics will show influence of the PME on cluster and user > > > >> operations. > > > >> >> > > > > > >> >> > > I have prepared PR for this (Bot visa is green). [1] Can anyone > > > >> take a > > > >> >> > > look? > > > >> >> > > > > > >> >> > > [1] https://issues.apache.org/jira/browse/IGNITE-11961 > > > >> >> > > > > > >> >> > > вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov < > > > nizhi...@apache.org > > > >> >: > > > >> >> > > > > > >> >> > > > > > > >> >> > > > I think administator of Ignite cluster should be able to > > > monitor > > > >> all > > > >> >> > > Ignite process, including non blocking PME. > > > >> >> > > > > > > >> >> > > > В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет: > > > >> >> > > > > BTW, > > > >> >> > > > > Found PME metric - getCurrentPmeDuration(). > > > >> >> > > > > Seems, it shows exactly PME time and not so useful because > > of > > > >> this. > > > >> >> > > > > The goal it so show exactly blocking period. > > > >> >> > > > > When PME cause no blocking, it's a good PME and I see no > > > >> reason to have > > > >> >> > > > > monitoring related to it :) > > > >> >> > > > > > > > >> >> > > > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov < > > > >> nizhi...@apache.org> > > > >> >> > > wrote: > > > >> >> > > > > > > > >> >> > > > > > Anton. > > > >> >> > > > > > > > > >> >> > > > > > Why do we need to postpone implementation of this > > metrics? > > > >> >> > > > > > For now, implementation of new metric is very simple. > > > >> >> > > > > > > > > >> >> > > > > > I think we can implement this metrics as a single > > > >> contribution. > > > >> >> > > > > > > > > >> >> > > > > > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет: > > > >> >> > > > > > > Nikita, > > > >> >> > > > > > > > > > >> >> > > > > > > Looks like all we need now is a 1 simple metric: are > > > >> operations > > > >> >> > > blocked? > > > >> >> > > > > > > Just a true or false. > > > >> >> > > > > > > Lest start from this. > > > >> >> > > > > > > All other metrics can be extracted from logs now and > > can > > > be > > > >> >> > > implemented > > > >> >> > > > > > > later. > > > >> >> > > > > > > > > > >> >> > > > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov < > > > >> >> > > nizhi...@apache.org> > > > >> >> > > > > > > wrote: > > > >> >> > > > > > > > > > >> >> > > > > > > > +1. > > > >> >> > > > > > > > > > > >> >> > > > > > > > Nikita, please, go ahead. > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev < > > > >> nsamelc...@gmail.com > > > >> >> > > >: > > > >> >> > > > > > > > > > > >> >> > > > > > > > > Hello, Igniters. > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > I suggest to add some useful metrics about the > > > >> partition map > > > >> >> > > exchange > > > >> >> > > > > > > > > (PME). For now, the duration of PME stages > > available > > > >> only in > > > >> >> > > log > > > >> >> > > > > > > > > >> >> > > > > > files > > > >> >> > > > > > > > > and cannot be obtained using JMX or other external > > > >> tools. [1] > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > I made the list of local node metrics that help to > > > >> understand > > > >> >> > > the > > > >> >> > > > > > > > > actual status of current PME: > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > 1. initialVersion. Topology version that initiates > > > the > > > >> >> > > exchange. > > > >> >> > > > > > > > > 2. initTime. Time PME was started. > > > >> >> > > > > > > > > 3. initEvent. Event that triggered PME. > > > >> >> > > > > > > > > 4. partitionReleaseTime. Time when a node has > > > finished > > > >> waiting > > > >> >> > > for > > > >> >> > > > > > > > > >> >> > > > > > all > > > >> >> > > > > > > > > updates and translations on a previous topology. > > > >> >> > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a > > > >> single > > > >> >> > > message. > > > >> >> > > > > > > > > 6. recieveFullMessageTime. Time when a node > > received > > > a > > > >> full > > > >> >> > > message. > > > >> >> > > > > > > > > 7. finishTime. Time PME was ended. > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > When new PME started all these metrics resets. > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > These metrics help to understand: > > > >> >> > > > > > > > > - how long PME was (current or previous). > > > >> >> > > > > > > > > - how long awaited for all updates was completed. > > > >> >> > > > > > > > > - what node blocks PME (didn't send a single > > message) > > > >> >> > > > > > > > > - what triggered PME. > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > Thoughts? > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > [1] > > > https://issues.apache.org/jira/browse/IGNITE-11961 > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > -- > > > >> >> > > > > > > > > Best wishes, > > > >> >> > > > > > > > > Amelchev Nikita > > > >> >> > > > > > > > > > > > >> >> > > > > > >> >> > > > > > >> >> > > > > > >> >> > > -- > > > >> >> > > Best wishes, > > > >> >> > > Amelchev Nikita > > > >> >> > > > > > >> >> > > > >> >> > > > >> >> > > > >> >> -- > > > >> >> Best wishes, > > > >> >> Amelchev Nikita > > > >> > > > >> > > > >> > > > >> -- > > > >> Best wishes, > > > >> Amelchev Nikita > > > >> > > > > > > > > > -- Best wishes, Amelchev Nikita