Nikita, Yes, I mean duration not timestamp. For the metric name, I suggest "cacheOperationsBlockingDuration", I think it cleaner represents what is blocked during PME. We can also combine both timestamp "cacheOperationsBlockingStartTs" and duration to have better correlation when cache operations were blocked and how much time it's taken. For instant view (like in JMX bean) a calculated value as you mentioned can be used. For metrics are exported to some backend (IEP-35) a counter can be used. The counter is incremented by blocking time after blocking has ended.
пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <nsamelc...@gmail.com>: > Pavel, > > The main purpose of this metric is > >> how much time we wait for resuming cache operations > > Seems I misunderstood you. Do you mean timestamp or duration here? > >> What do you think if we change the boolean value of metric to a long > value that represents time in milliseconds when operations were blocked? > > This time can be calculated as (currentTime - > timeSinceOperationsBlocked) in case of timestamp. > > Duration will be more understandable. It'll be something like > getCurrentBlockingPmeDuration. But I haven't come up with a better > name yet. > > пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <jokse...@gmail.com>: > > > > Nikita, > > > > I think getCurrentPmeDuration doesn't show useful information. The main > PME side effect for end-users is blocking cache operations. Not all PME > time blocks it. > > What information gives to an end-user timestamp of > "timeSinceOperationsBlocked"? For what analysis it can be used and how? > > > > пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <nsamelc...@gmail.com>: > >> > >> Hi Pavel, > >> > >> This time already can be obtained from the getCurrentPmeDuration and > >> new isOperationsBlockedByPme metrics. > >> > >> As an alternative solution, I can rework recently added > >> getCurrentPmeDuration metric (not released yet). Seems for users it > >> useless in case of non-blocking PME. > >> Lets name it timeSinceOperationsBlocked. It'll be timestamp when > >> blocking started (minimal value of cluster nodes) and 0 if blocking > >> ends (there is no running PME). > >> > >> WDYT? > >> > >> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <jokse...@gmail.com>: > >> > > >> > Hi Nikita, > >> > > >> > Thank you for working on this. What do you think if we change the > boolean > >> > value of metric to a long value that represents time in milliseconds > when > >> > operations were blocked? > >> > Since we have not only JMX and now metrics are periodically exported > to > >> > some backend it can give a more clear picture of how much time we > wait for > >> > resuming cache operations instead of instant boolean indicator. > >> > > >> > пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <nsamelc...@gmail.com>: > >> > > >> > > Anton, Nikolay, > >> > > > >> > > Thanks for the support. > >> > > > >> > > For now, we have the getCurrentPmeDuration() metric that does not > show > >> > > influence on the cluster correctly. PME can be without blocking > >> > > operations. For example, client node join/leave events. > >> > > > >> > > I suggest add new metric - isOperationsBlockedByPme(). Together, > these > >> > > metrics will show influence of the PME on cluster and user > operations. > >> > > > >> > > I have prepared PR for this (Bot visa is green). [1] Can anyone > take a > >> > > look? > >> > > > >> > > [1] https://issues.apache.org/jira/browse/IGNITE-11961 > >> > > > >> > > вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <nizhi...@apache.org>: > >> > > > >> > > > > >> > > > I think administator of Ignite cluster should be able to monitor > all > >> > > Ignite process, including non blocking PME. > >> > > > > >> > > > В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет: > >> > > > > BTW, > >> > > > > Found PME metric - getCurrentPmeDuration(). > >> > > > > Seems, it shows exactly PME time and not so useful because of > this. > >> > > > > The goal it so show exactly blocking period. > >> > > > > When PME cause no blocking, it's a good PME and I see no reason > to have > >> > > > > monitoring related to it :) > >> > > > > > >> > > > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov < > nizhi...@apache.org> > >> > > wrote: > >> > > > > > >> > > > > > Anton. > >> > > > > > > >> > > > > > Why do we need to postpone implementation of this metrics? > >> > > > > > For now, implementation of new metric is very simple. > >> > > > > > > >> > > > > > I think we can implement this metrics as a single > contribution. > >> > > > > > > >> > > > > > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет: > >> > > > > > > Nikita, > >> > > > > > > > >> > > > > > > Looks like all we need now is a 1 simple metric: are > operations > >> > > blocked? > >> > > > > > > Just a true or false. > >> > > > > > > Lest start from this. > >> > > > > > > All other metrics can be extracted from logs now and can be > >> > > implemented > >> > > > > > > later. > >> > > > > > > > >> > > > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov < > >> > > nizhi...@apache.org> > >> > > > > > > wrote: > >> > > > > > > > >> > > > > > > > +1. > >> > > > > > > > > >> > > > > > > > Nikita, please, go ahead. > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev < > nsamelc...@gmail.com > >> > > >: > >> > > > > > > > > >> > > > > > > > > Hello, Igniters. > >> > > > > > > > > > >> > > > > > > > > I suggest to add some useful metrics about the > partition map > >> > > exchange > >> > > > > > > > > (PME). For now, the duration of PME stages available > only in > >> > > log > >> > > > > > > >> > > > > > files > >> > > > > > > > > and cannot be obtained using JMX or other external > tools. [1] > >> > > > > > > > > > >> > > > > > > > > I made the list of local node metrics that help to > understand > >> > > the > >> > > > > > > > > actual status of current PME: > >> > > > > > > > > > >> > > > > > > > > 1. initialVersion. Topology version that initiates the > >> > > exchange. > >> > > > > > > > > 2. initTime. Time PME was started. > >> > > > > > > > > 3. initEvent. Event that triggered PME. > >> > > > > > > > > 4. partitionReleaseTime. Time when a node has finished > waiting > >> > > for > >> > > > > > > >> > > > > > all > >> > > > > > > > > updates and translations on a previous topology. > >> > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a single > >> > > message. > >> > > > > > > > > 6. recieveFullMessageTime. Time when a node received a > full > >> > > message. > >> > > > > > > > > 7. finishTime. Time PME was ended. > >> > > > > > > > > > >> > > > > > > > > When new PME started all these metrics resets. > >> > > > > > > > > > >> > > > > > > > > These metrics help to understand: > >> > > > > > > > > - how long PME was (current or previous). > >> > > > > > > > > - how long awaited for all updates was completed. > >> > > > > > > > > - what node blocks PME (didn't send a single message) > >> > > > > > > > > - what triggered PME. > >> > > > > > > > > > >> > > > > > > > > Thoughts? > >> > > > > > > > > > >> > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11961 > >> > > > > > > > > > >> > > > > > > > > -- > >> > > > > > > > > Best wishes, > >> > > > > > > > > Amelchev Nikita > >> > > > > > > > > > >> > > > >> > > > >> > > > >> > > -- > >> > > Best wishes, > >> > > Amelchev Nikita > >> > > > >> > >> > >> > >> -- > >> Best wishes, > >> Amelchev Nikita > > > > -- > Best wishes, > Amelchev Nikita >