Hi Pavel, This time already can be obtained from the getCurrentPmeDuration and new isOperationsBlockedByPme metrics.
As an alternative solution, I can rework recently added getCurrentPmeDuration metric (not released yet). Seems for users it useless in case of non-blocking PME. Lets name it timeSinceOperationsBlocked. It'll be timestamp when blocking started (minimal value of cluster nodes) and 0 if blocking ends (there is no running PME). WDYT? пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <jokse...@gmail.com>: > > Hi Nikita, > > Thank you for working on this. What do you think if we change the boolean > value of metric to a long value that represents time in milliseconds when > operations were blocked? > Since we have not only JMX and now metrics are periodically exported to > some backend it can give a more clear picture of how much time we wait for > resuming cache operations instead of instant boolean indicator. > > пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <nsamelc...@gmail.com>: > > > Anton, Nikolay, > > > > Thanks for the support. > > > > For now, we have the getCurrentPmeDuration() metric that does not show > > influence on the cluster correctly. PME can be without blocking > > operations. For example, client node join/leave events. > > > > I suggest add new metric - isOperationsBlockedByPme(). Together, these > > metrics will show influence of the PME on cluster and user operations. > > > > I have prepared PR for this (Bot visa is green). [1] Can anyone take a > > look? > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11961 > > > > вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <nizhi...@apache.org>: > > > > > > > > I think administator of Ignite cluster should be able to monitor all > > Ignite process, including non blocking PME. > > > > > > В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет: > > > > BTW, > > > > Found PME metric - getCurrentPmeDuration(). > > > > Seems, it shows exactly PME time and not so useful because of this. > > > > The goal it so show exactly blocking period. > > > > When PME cause no blocking, it's a good PME and I see no reason to have > > > > monitoring related to it :) > > > > > > > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <nizhi...@apache.org> > > wrote: > > > > > > > > > Anton. > > > > > > > > > > Why do we need to postpone implementation of this metrics? > > > > > For now, implementation of new metric is very simple. > > > > > > > > > > I think we can implement this metrics as a single contribution. > > > > > > > > > > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет: > > > > > > Nikita, > > > > > > > > > > > > Looks like all we need now is a 1 simple metric: are operations > > blocked? > > > > > > Just a true or false. > > > > > > Lest start from this. > > > > > > All other metrics can be extracted from logs now and can be > > implemented > > > > > > later. > > > > > > > > > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov < > > nizhi...@apache.org> > > > > > > wrote: > > > > > > > > > > > > > +1. > > > > > > > > > > > > > > Nikita, please, go ahead. > > > > > > > > > > > > > > > > > > > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev <nsamelc...@gmail.com > > >: > > > > > > > > > > > > > > > Hello, Igniters. > > > > > > > > > > > > > > > > I suggest to add some useful metrics about the partition map > > exchange > > > > > > > > (PME). For now, the duration of PME stages available only in > > log > > > > > > > > > > files > > > > > > > > and cannot be obtained using JMX or other external tools. [1] > > > > > > > > > > > > > > > > I made the list of local node metrics that help to understand > > the > > > > > > > > actual status of current PME: > > > > > > > > > > > > > > > > 1. initialVersion. Topology version that initiates the > > exchange. > > > > > > > > 2. initTime. Time PME was started. > > > > > > > > 3. initEvent. Event that triggered PME. > > > > > > > > 4. partitionReleaseTime. Time when a node has finished waiting > > for > > > > > > > > > > all > > > > > > > > updates and translations on a previous topology. > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a single > > message. > > > > > > > > 6. recieveFullMessageTime. Time when a node received a full > > message. > > > > > > > > 7. finishTime. Time PME was ended. > > > > > > > > > > > > > > > > When new PME started all these metrics resets. > > > > > > > > > > > > > > > > These metrics help to understand: > > > > > > > > - how long PME was (current or previous). > > > > > > > > - how long awaited for all updates was completed. > > > > > > > > - what node blocks PME (didn't send a single message) > > > > > > > > - what triggered PME. > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11961 > > > > > > > > > > > > > > > > -- > > > > > > > > Best wishes, > > > > > > > > Amelchev Nikita > > > > > > > > > > > > > > > > -- > > Best wishes, > > Amelchev Nikita > > -- Best wishes, Amelchev Nikita