Folks,

All previous suggestions have some disadvantages. It can be several
exchanges between two metric updates and fast exchange can rewrite
previous long exchange.

We can introduce a metric of total blocking duration that will
accumulate at the end of the exchange. So, users will get actual
information about how long operations were blocked. Cluster metric
will be a maximum of local nodes metrics. And we need a boolean metric
that will indicate realtime status. It needs because of duration
metric updates at the end of the exchange.

So I propose to change the current metric that not released to the
totalCacheOperationsBlockingDuration metric and to add the
isCacheOperationsBlocked metric.

WDYT?

пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov <a...@apache.org>:
>
> Nikolay,
>
> Still see no reason to replace boolean with long.
>
> On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov <nizhi...@apache.org> wrote:
>
> > Anton.
> >
> > 1. Value exported based on SPI settings, not in the moment it changed.
> >
> > 2. Clock synchronisation - if we export start time, we should also export
> > node local timestamp.
> >
> > пн, 22 июля 2019 г., 8:33 Anton Vinogradov <a...@apache.org>:
> >
> > > Folks,
> > >
> > > What's the reason for duration counting?
> > > AFAIU, it's a monitoring system feature to count the durations.
> > > Sine monitoring system checks metrics periodically it will know the
> > > duration by its own log.
> > >
> > > On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <jokse...@gmail.com>
> > > wrote:
> > >
> > > > Nikita,
> > > >
> > > > Yes, I mean duration not timestamp. For the metric name, I suggest
> > > > "cacheOperationsBlockingDuration", I think it cleaner represents what
> > is
> > > > blocked during PME.
> > > > We can also combine both timestamp "cacheOperationsBlockingStartTs" and
> > > > duration to have better correlation when cache operations were blocked
> > > and
> > > > how much time it's taken.
> > > > For instant view (like in JMX bean) a calculated value as you mentioned
> > > > can be used.
> > > > For metrics are exported to some backend (IEP-35) a counter can be
> > used.
> > > > The counter is incremented by blocking time after blocking has ended.
> > > >
> > > > пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <nsamelc...@gmail.com>:
> > > >
> > > >> Pavel,
> > > >>
> > > >> The main purpose of this metric is
> > > >> >> how much time we wait for resuming cache operations
> > > >>
> > > >> Seems I misunderstood you. Do you mean timestamp or duration here?
> > > >> >> What do you think if we change the boolean value of metric to a
> > long
> > > >> value that represents time in milliseconds when operations were
> > blocked?
> > > >>
> > > >> This time can be calculated as (currentTime -
> > > >> timeSinceOperationsBlocked) in case of timestamp.
> > > >>
> > > >> Duration will be more understandable. It'll be something like
> > > >> getCurrentBlockingPmeDuration. But I haven't come up with a better
> > > >> name yet.
> > > >>
> > > >> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <jokse...@gmail.com>:
> > > >> >
> > > >> > Nikita,
> > > >> >
> > > >> > I think getCurrentPmeDuration doesn't show useful information. The
> > > main
> > > >> PME side effect for end-users is blocking cache operations. Not all
> > PME
> > > >> time blocks it.
> > > >> > What information gives to an end-user timestamp of
> > > >> "timeSinceOperationsBlocked"? For what analysis it can be used and
> > how?
> > > >> >
> > > >> > пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <nsamelc...@gmail.com
> > >:
> > > >> >>
> > > >> >> Hi Pavel,
> > > >> >>
> > > >> >> This time already can be obtained from the getCurrentPmeDuration
> > and
> > > >> >> new isOperationsBlockedByPme metrics.
> > > >> >>
> > > >> >> As an alternative solution, I can rework recently added
> > > >> >> getCurrentPmeDuration metric (not released yet). Seems for users it
> > > >> >> useless in case of non-blocking PME.
> > > >> >> Lets name it timeSinceOperationsBlocked. It'll be timestamp when
> > > >> >> blocking started (minimal value of cluster nodes) and 0 if blocking
> > > >> >> ends (there is no running PME).
> > > >> >>
> > > >> >> WDYT?
> > > >> >>
> > > >> >> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <jokse...@gmail.com>:
> > > >> >> >
> > > >> >> > Hi Nikita,
> > > >> >> >
> > > >> >> > Thank you for working on this. What do you think if we change the
> > > >> boolean
> > > >> >> > value of metric to a long value that represents time in
> > > milliseconds
> > > >> when
> > > >> >> > operations were blocked?
> > > >> >> > Since we have not only JMX and now metrics are periodically
> > > exported
> > > >> to
> > > >> >> > some backend it can give a more clear picture of how much time we
> > > >> wait for
> > > >> >> > resuming cache operations instead of instant boolean indicator.
> > > >> >> >
> > > >> >> > пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <
> > nsamelc...@gmail.com
> > > >:
> > > >> >> >
> > > >> >> > > Anton, Nikolay,
> > > >> >> > >
> > > >> >> > > Thanks for the support.
> > > >> >> > >
> > > >> >> > > For now, we have the getCurrentPmeDuration() metric that does
> > not
> > > >> show
> > > >> >> > > influence on the cluster correctly. PME can be without blocking
> > > >> >> > > operations. For example, client node join/leave events.
> > > >> >> > >
> > > >> >> > > I suggest add new metric - isOperationsBlockedByPme().
> > Together,
> > > >> these
> > > >> >> > > metrics will show influence of the PME on cluster and user
> > > >> operations.
> > > >> >> > >
> > > >> >> > > I have prepared PR for this (Bot visa is green). [1] Can anyone
> > > >> take a
> > > >> >> > > look?
> > > >> >> > >
> > > >> >> > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> > > >> >> > >
> > > >> >> > > вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <
> > > nizhi...@apache.org
> > > >> >:
> > > >> >> > >
> > > >> >> > > >
> > > >> >> > > > I think administator of Ignite cluster should be able to
> > > monitor
> > > >> all
> > > >> >> > > Ignite process, including non blocking PME.
> > > >> >> > > >
> > > >> >> > > > В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> > > >> >> > > > > BTW,
> > > >> >> > > > > Found PME metric - getCurrentPmeDuration().
> > > >> >> > > > > Seems, it shows exactly PME time and not so useful because
> > of
> > > >> this.
> > > >> >> > > > > The goal it so show exactly blocking period.
> > > >> >> > > > > When PME cause no blocking, it's a good PME and I see no
> > > >> reason to have
> > > >> >> > > > > monitoring related to it :)
> > > >> >> > > > >
> > > >> >> > > > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
> > > >> nizhi...@apache.org>
> > > >> >> > > wrote:
> > > >> >> > > > >
> > > >> >> > > > > > Anton.
> > > >> >> > > > > >
> > > >> >> > > > > > Why do we need to postpone implementation of this
> > metrics?
> > > >> >> > > > > > For now, implementation of new metric is very simple.
> > > >> >> > > > > >
> > > >> >> > > > > > I think we can implement this metrics as a single
> > > >> contribution.
> > > >> >> > > > > >
> > > >> >> > > > > > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет:
> > > >> >> > > > > > > Nikita,
> > > >> >> > > > > > >
> > > >> >> > > > > > > Looks like all we need now is a 1 simple metric: are
> > > >> operations
> > > >> >> > > blocked?
> > > >> >> > > > > > > Just a true or false.
> > > >> >> > > > > > > Lest start from this.
> > > >> >> > > > > > > All other metrics can be extracted from logs now and
> > can
> > > be
> > > >> >> > > implemented
> > > >> >> > > > > > > later.
> > > >> >> > > > > > >
> > > >> >> > > > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
> > > >> >> > > nizhi...@apache.org>
> > > >> >> > > > > > > wrote:
> > > >> >> > > > > > >
> > > >> >> > > > > > > > +1.
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > Nikita, please, go ahead.
> > > >> >> > > > > > > >
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
> > > >> nsamelc...@gmail.com
> > > >> >> > > >:
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > > Hello, Igniters.
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > I suggest to add some useful metrics about the
> > > >> partition map
> > > >> >> > > exchange
> > > >> >> > > > > > > > > (PME). For now, the duration of PME stages
> > available
> > > >> only in
> > > >> >> > > log
> > > >> >> > > > > >
> > > >> >> > > > > > files
> > > >> >> > > > > > > > > and cannot be obtained using JMX or other external
> > > >> tools. [1]
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > I made the list of local node metrics that help to
> > > >> understand
> > > >> >> > > the
> > > >> >> > > > > > > > > actual status of current PME:
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > 1. initialVersion. Topology version that initiates
> > > the
> > > >> >> > > exchange.
> > > >> >> > > > > > > > > 2. initTime. Time PME was started.
> > > >> >> > > > > > > > > 3. initEvent. Event that triggered PME.
> > > >> >> > > > > > > > > 4. partitionReleaseTime. Time when a node has
> > > finished
> > > >> waiting
> > > >> >> > > for
> > > >> >> > > > > >
> > > >> >> > > > > > all
> > > >> >> > > > > > > > > updates and translations on a previous topology.
> > > >> >> > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a
> > > >> single
> > > >> >> > > message.
> > > >> >> > > > > > > > > 6. recieveFullMessageTime. Time when a node
> > received
> > > a
> > > >> full
> > > >> >> > > message.
> > > >> >> > > > > > > > > 7. finishTime. Time PME was ended.
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > When new PME started all these metrics resets.
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > These metrics help to understand:
> > > >> >> > > > > > > > > - how long PME was (current or previous).
> > > >> >> > > > > > > > > - how long awaited for all updates was completed.
> > > >> >> > > > > > > > > - what node blocks PME (didn't send a single
> > message)
> > > >> >> > > > > > > > > - what triggered PME.
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > Thoughts?
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > [1]
> > > https://issues.apache.org/jira/browse/IGNITE-11961
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > --
> > > >> >> > > > > > > > > Best wishes,
> > > >> >> > > > > > > > > Amelchev Nikita
> > > >> >> > > > > > > > >
> > > >> >> > >
> > > >> >> > >
> > > >> >> > >
> > > >> >> > > --
> > > >> >> > > Best wishes,
> > > >> >> > > Amelchev Nikita
> > > >> >> > >
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> --
> > > >> >> Best wishes,
> > > >> >> Amelchev Nikita
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Best wishes,
> > > >> Amelchev Nikita
> > > >>
> > > >
> > >
> >



-- 
Best wishes,
Amelchev Nikita

Reply via email to