Re: Partition map exchange metrics

Nikita Amelchev Fri, 19 Jul 2019 07:48:55 -0700

Hi Pavel,

This time already can be obtained from the getCurrentPmeDuration and
new isOperationsBlockedByPme metrics.


As an alternative solution, I can rework recently added
getCurrentPmeDuration metric (not released yet). Seems for users it
useless in case of non-blocking PME.
Lets name it timeSinceOperationsBlocked. It'll be timestamp when
blocking started (minimal value of cluster nodes) and 0 if blocking
ends (there is no running PME).

WDYT?

пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <jokse...@gmail.com>:
>
> Hi Nikita,
>
> Thank you for working on this. What do you think if we change the boolean
> value of metric to a long value that represents time in milliseconds when
> operations were blocked?
> Since we have not only JMX and now metrics are periodically exported to
> some backend it can give a more clear picture of how much time we wait for
> resuming cache operations instead of instant boolean indicator.
>
> пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <nsamelc...@gmail.com>:
>
> > Anton, Nikolay,
> >
> > Thanks for the support.
> >
> > For now, we have the getCurrentPmeDuration() metric that does not show
> > influence on the cluster correctly. PME can be without blocking
> > operations. For example, client node join/leave events.
> >
> > I suggest add new metric - isOperationsBlockedByPme(). Together, these
> > metrics will show influence of the PME on cluster and user operations.
> >
> > I have prepared PR for this (Bot visa is green). [1] Can anyone take a
> > look?
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> >
> > вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <nizhi...@apache.org>:
> >
> > >
> > > I think administator of Ignite cluster should be able to monitor all
> > Ignite process, including non blocking PME.
> > >
> > > В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> > > > BTW,
> > > > Found PME metric - getCurrentPmeDuration().
> > > > Seems, it shows exactly PME time and not so useful because of this.
> > > > The goal it so show exactly blocking period.
> > > > When PME cause no blocking, it's a good PME and I see no reason to have
> > > > monitoring related to it :)
> > > >
> > > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <nizhi...@apache.org>
> > wrote:
> > > >
> > > > > Anton.
> > > > >
> > > > > Why do we need to postpone implementation of this metrics?
> > > > > For now, implementation of new metric is very simple.
> > > > >
> > > > > I think we can implement this metrics as a single contribution.
> > > > >
> > > > > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет:
> > > > > > Nikita,
> > > > > >
> > > > > > Looks like all we need now is a 1 simple metric: are operations
> > blocked?
> > > > > > Just a true or false.
> > > > > > Lest start from this.
> > > > > > All other metrics can be extracted from logs now and can be
> > implemented
> > > > > > later.
> > > > > >
> > > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
> > nizhi...@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > > > +1.
> > > > > > >
> > > > > > > Nikita, please, go ahead.
> > > > > > >
> > > > > > >
> > > > > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev <nsamelc...@gmail.com
> > >:
> > > > > > >
> > > > > > > > Hello, Igniters.
> > > > > > > >
> > > > > > > > I suggest to add some useful metrics about the partition map
> > exchange
> > > > > > > > (PME). For now, the duration of PME stages available only in
> > log
> > > > >
> > > > > files
> > > > > > > > and cannot be obtained using JMX or other external tools. [1]
> > > > > > > >
> > > > > > > > I made the list of local node metrics that help to understand
> > the
> > > > > > > > actual status of current PME:
> > > > > > > >
> > > > > > > > 1. initialVersion. Topology version that initiates the
> > exchange.
> > > > > > > > 2. initTime. Time PME was started.
> > > > > > > > 3. initEvent. Event that triggered PME.
> > > > > > > > 4. partitionReleaseTime. Time when a node has finished waiting
> > for
> > > > >
> > > > > all
> > > > > > > > updates and translations on a previous topology.
> > > > > > > > 5. sendSingleMessageTime. Time when a node sent a single
> > message.
> > > > > > > > 6. recieveFullMessageTime. Time when a node received a full
> > message.
> > > > > > > > 7. finishTime. Time PME was ended.
> > > > > > > >
> > > > > > > > When new PME started all these metrics resets.
> > > > > > > >
> > > > > > > > These metrics help to understand:
> > > > > > > > - how long PME was (current or previous).
> > > > > > > > - how long awaited for all updates was completed.
> > > > > > > > - what node blocks PME (didn't send a single message)
> > > > > > > > - what triggered PME.
> > > > > > > >
> > > > > > > > Thoughts?
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best wishes,
> > > > > > > > Amelchev Nikita
> > > > > > > >
> >
> >
> >
> > --
> > Best wishes,
> > Amelchev Nikita
> >



-- 
Best wishes,
Amelchev Nikita

Re: Partition map exchange metrics

Reply via email to