Folks, What's the reason for duration counting? AFAIU, it's a monitoring system feature to count the durations. Sine monitoring system checks metrics periodically it will know the duration by its own log.
On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <jokse...@gmail.com> wrote: > Nikita, > > Yes, I mean duration not timestamp. For the metric name, I suggest > "cacheOperationsBlockingDuration", I think it cleaner represents what is > blocked during PME. > We can also combine both timestamp "cacheOperationsBlockingStartTs" and > duration to have better correlation when cache operations were blocked and > how much time it's taken. > For instant view (like in JMX bean) a calculated value as you mentioned > can be used. > For metrics are exported to some backend (IEP-35) a counter can be used. > The counter is incremented by blocking time after blocking has ended. > > пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <nsamelc...@gmail.com>: > >> Pavel, >> >> The main purpose of this metric is >> >> how much time we wait for resuming cache operations >> >> Seems I misunderstood you. Do you mean timestamp or duration here? >> >> What do you think if we change the boolean value of metric to a long >> value that represents time in milliseconds when operations were blocked? >> >> This time can be calculated as (currentTime - >> timeSinceOperationsBlocked) in case of timestamp. >> >> Duration will be more understandable. It'll be something like >> getCurrentBlockingPmeDuration. But I haven't come up with a better >> name yet. >> >> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <jokse...@gmail.com>: >> > >> > Nikita, >> > >> > I think getCurrentPmeDuration doesn't show useful information. The main >> PME side effect for end-users is blocking cache operations. Not all PME >> time blocks it. >> > What information gives to an end-user timestamp of >> "timeSinceOperationsBlocked"? For what analysis it can be used and how? >> > >> > пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <nsamelc...@gmail.com>: >> >> >> >> Hi Pavel, >> >> >> >> This time already can be obtained from the getCurrentPmeDuration and >> >> new isOperationsBlockedByPme metrics. >> >> >> >> As an alternative solution, I can rework recently added >> >> getCurrentPmeDuration metric (not released yet). Seems for users it >> >> useless in case of non-blocking PME. >> >> Lets name it timeSinceOperationsBlocked. It'll be timestamp when >> >> blocking started (minimal value of cluster nodes) and 0 if blocking >> >> ends (there is no running PME). >> >> >> >> WDYT? >> >> >> >> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <jokse...@gmail.com>: >> >> > >> >> > Hi Nikita, >> >> > >> >> > Thank you for working on this. What do you think if we change the >> boolean >> >> > value of metric to a long value that represents time in milliseconds >> when >> >> > operations were blocked? >> >> > Since we have not only JMX and now metrics are periodically exported >> to >> >> > some backend it can give a more clear picture of how much time we >> wait for >> >> > resuming cache operations instead of instant boolean indicator. >> >> > >> >> > пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <nsamelc...@gmail.com>: >> >> > >> >> > > Anton, Nikolay, >> >> > > >> >> > > Thanks for the support. >> >> > > >> >> > > For now, we have the getCurrentPmeDuration() metric that does not >> show >> >> > > influence on the cluster correctly. PME can be without blocking >> >> > > operations. For example, client node join/leave events. >> >> > > >> >> > > I suggest add new metric - isOperationsBlockedByPme(). Together, >> these >> >> > > metrics will show influence of the PME on cluster and user >> operations. >> >> > > >> >> > > I have prepared PR for this (Bot visa is green). [1] Can anyone >> take a >> >> > > look? >> >> > > >> >> > > [1] https://issues.apache.org/jira/browse/IGNITE-11961 >> >> > > >> >> > > вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <nizhi...@apache.org >> >: >> >> > > >> >> > > > >> >> > > > I think administator of Ignite cluster should be able to monitor >> all >> >> > > Ignite process, including non blocking PME. >> >> > > > >> >> > > > В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет: >> >> > > > > BTW, >> >> > > > > Found PME metric - getCurrentPmeDuration(). >> >> > > > > Seems, it shows exactly PME time and not so useful because of >> this. >> >> > > > > The goal it so show exactly blocking period. >> >> > > > > When PME cause no blocking, it's a good PME and I see no >> reason to have >> >> > > > > monitoring related to it :) >> >> > > > > >> >> > > > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov < >> nizhi...@apache.org> >> >> > > wrote: >> >> > > > > >> >> > > > > > Anton. >> >> > > > > > >> >> > > > > > Why do we need to postpone implementation of this metrics? >> >> > > > > > For now, implementation of new metric is very simple. >> >> > > > > > >> >> > > > > > I think we can implement this metrics as a single >> contribution. >> >> > > > > > >> >> > > > > > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет: >> >> > > > > > > Nikita, >> >> > > > > > > >> >> > > > > > > Looks like all we need now is a 1 simple metric: are >> operations >> >> > > blocked? >> >> > > > > > > Just a true or false. >> >> > > > > > > Lest start from this. >> >> > > > > > > All other metrics can be extracted from logs now and can be >> >> > > implemented >> >> > > > > > > later. >> >> > > > > > > >> >> > > > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov < >> >> > > nizhi...@apache.org> >> >> > > > > > > wrote: >> >> > > > > > > >> >> > > > > > > > +1. >> >> > > > > > > > >> >> > > > > > > > Nikita, please, go ahead. >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev < >> nsamelc...@gmail.com >> >> > > >: >> >> > > > > > > > >> >> > > > > > > > > Hello, Igniters. >> >> > > > > > > > > >> >> > > > > > > > > I suggest to add some useful metrics about the >> partition map >> >> > > exchange >> >> > > > > > > > > (PME). For now, the duration of PME stages available >> only in >> >> > > log >> >> > > > > > >> >> > > > > > files >> >> > > > > > > > > and cannot be obtained using JMX or other external >> tools. [1] >> >> > > > > > > > > >> >> > > > > > > > > I made the list of local node metrics that help to >> understand >> >> > > the >> >> > > > > > > > > actual status of current PME: >> >> > > > > > > > > >> >> > > > > > > > > 1. initialVersion. Topology version that initiates the >> >> > > exchange. >> >> > > > > > > > > 2. initTime. Time PME was started. >> >> > > > > > > > > 3. initEvent. Event that triggered PME. >> >> > > > > > > > > 4. partitionReleaseTime. Time when a node has finished >> waiting >> >> > > for >> >> > > > > > >> >> > > > > > all >> >> > > > > > > > > updates and translations on a previous topology. >> >> > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a >> single >> >> > > message. >> >> > > > > > > > > 6. recieveFullMessageTime. Time when a node received a >> full >> >> > > message. >> >> > > > > > > > > 7. finishTime. Time PME was ended. >> >> > > > > > > > > >> >> > > > > > > > > When new PME started all these metrics resets. >> >> > > > > > > > > >> >> > > > > > > > > These metrics help to understand: >> >> > > > > > > > > - how long PME was (current or previous). >> >> > > > > > > > > - how long awaited for all updates was completed. >> >> > > > > > > > > - what node blocks PME (didn't send a single message) >> >> > > > > > > > > - what triggered PME. >> >> > > > > > > > > >> >> > > > > > > > > Thoughts? >> >> > > > > > > > > >> >> > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11961 >> >> > > > > > > > > >> >> > > > > > > > > -- >> >> > > > > > > > > Best wishes, >> >> > > > > > > > > Amelchev Nikita >> >> > > > > > > > > >> >> > > >> >> > > >> >> > > >> >> > > -- >> >> > > Best wishes, >> >> > > Amelchev Nikita >> >> > > >> >> >> >> >> >> >> >> -- >> >> Best wishes, >> >> Amelchev Nikita >> >> >> >> -- >> Best wishes, >> Amelchev Nikita >> >