Re: Partition map exchange metrics

Anton Vinogradov Sun, 21 Jul 2019 22:33:46 -0700

Folks,

What's the reason for duration counting?
AFAIU, it's a monitoring system feature to count the durations.
Sine monitoring system checks metrics periodically it will know the
duration by its own log.


On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <jokse...@gmail.com> wrote:

> Nikita,
>
> Yes, I mean duration not timestamp. For the metric name, I suggest
> "cacheOperationsBlockingDuration", I think it cleaner represents what is
> blocked during PME.
> We can also combine both timestamp "cacheOperationsBlockingStartTs" and
> duration to have better correlation when cache operations were blocked and
> how much time it's taken.
> For instant view (like in JMX bean) a calculated value as you mentioned
> can be used.
> For metrics are exported to some backend (IEP-35) a counter can be used.
> The counter is incremented by blocking time after blocking has ended.
>
> пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <nsamelc...@gmail.com>:
>
>> Pavel,
>>
>> The main purpose of this metric is
>> >> how much time we wait for resuming cache operations
>>
>> Seems I misunderstood you. Do you mean timestamp or duration here?
>> >> What do you think if we change the boolean value of metric to a long
>> value that represents time in milliseconds when operations were blocked?
>>
>> This time can be calculated as (currentTime -
>> timeSinceOperationsBlocked) in case of timestamp.
>>
>> Duration will be more understandable. It'll be something like
>> getCurrentBlockingPmeDuration. But I haven't come up with a better
>> name yet.
>>
>> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <jokse...@gmail.com>:
>> >
>> > Nikita,
>> >
>> > I think getCurrentPmeDuration doesn't show useful information. The main
>> PME side effect for end-users is blocking cache operations. Not all PME
>> time blocks it.
>> > What information gives to an end-user timestamp of
>> "timeSinceOperationsBlocked"? For what analysis it can be used and how?
>> >
>> > пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <nsamelc...@gmail.com>:
>> >>
>> >> Hi Pavel,
>> >>
>> >> This time already can be obtained from the getCurrentPmeDuration and
>> >> new isOperationsBlockedByPme metrics.
>> >>
>> >> As an alternative solution, I can rework recently added
>> >> getCurrentPmeDuration metric (not released yet). Seems for users it
>> >> useless in case of non-blocking PME.
>> >> Lets name it timeSinceOperationsBlocked. It'll be timestamp when
>> >> blocking started (minimal value of cluster nodes) and 0 if blocking
>> >> ends (there is no running PME).
>> >>
>> >> WDYT?
>> >>
>> >> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <jokse...@gmail.com>:
>> >> >
>> >> > Hi Nikita,
>> >> >
>> >> > Thank you for working on this. What do you think if we change the
>> boolean
>> >> > value of metric to a long value that represents time in milliseconds
>> when
>> >> > operations were blocked?
>> >> > Since we have not only JMX and now metrics are periodically exported
>> to
>> >> > some backend it can give a more clear picture of how much time we
>> wait for
>> >> > resuming cache operations instead of instant boolean indicator.
>> >> >
>> >> > пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <nsamelc...@gmail.com>:
>> >> >
>> >> > > Anton, Nikolay,
>> >> > >
>> >> > > Thanks for the support.
>> >> > >
>> >> > > For now, we have the getCurrentPmeDuration() metric that does not
>> show
>> >> > > influence on the cluster correctly. PME can be without blocking
>> >> > > operations. For example, client node join/leave events.
>> >> > >
>> >> > > I suggest add new metric - isOperationsBlockedByPme(). Together,
>> these
>> >> > > metrics will show influence of the PME on cluster and user
>> operations.
>> >> > >
>> >> > > I have prepared PR for this (Bot visa is green). [1] Can anyone
>> take a
>> >> > > look?
>> >> > >
>> >> > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
>> >> > >
>> >> > > вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <nizhi...@apache.org
>> >:
>> >> > >
>> >> > > >
>> >> > > > I think administator of Ignite cluster should be able to monitor
>> all
>> >> > > Ignite process, including non blocking PME.
>> >> > > >
>> >> > > > В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
>> >> > > > > BTW,
>> >> > > > > Found PME metric - getCurrentPmeDuration().
>> >> > > > > Seems, it shows exactly PME time and not so useful because of
>> this.
>> >> > > > > The goal it so show exactly blocking period.
>> >> > > > > When PME cause no blocking, it's a good PME and I see no
>> reason to have
>> >> > > > > monitoring related to it :)
>> >> > > > >
>> >> > > > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
>> nizhi...@apache.org>
>> >> > > wrote:
>> >> > > > >
>> >> > > > > > Anton.
>> >> > > > > >
>> >> > > > > > Why do we need to postpone implementation of this metrics?
>> >> > > > > > For now, implementation of new metric is very simple.
>> >> > > > > >
>> >> > > > > > I think we can implement this metrics as a single
>> contribution.
>> >> > > > > >
>> >> > > > > > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет:
>> >> > > > > > > Nikita,
>> >> > > > > > >
>> >> > > > > > > Looks like all we need now is a 1 simple metric: are
>> operations
>> >> > > blocked?
>> >> > > > > > > Just a true or false.
>> >> > > > > > > Lest start from this.
>> >> > > > > > > All other metrics can be extracted from logs now and can be
>> >> > > implemented
>> >> > > > > > > later.
>> >> > > > > > >
>> >> > > > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
>> >> > > nizhi...@apache.org>
>> >> > > > > > > wrote:
>> >> > > > > > >
>> >> > > > > > > > +1.
>> >> > > > > > > >
>> >> > > > > > > > Nikita, please, go ahead.
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
>> nsamelc...@gmail.com
>> >> > > >:
>> >> > > > > > > >
>> >> > > > > > > > > Hello, Igniters.
>> >> > > > > > > > >
>> >> > > > > > > > > I suggest to add some useful metrics about the
>> partition map
>> >> > > exchange
>> >> > > > > > > > > (PME). For now, the duration of PME stages available
>> only in
>> >> > > log
>> >> > > > > >
>> >> > > > > > files
>> >> > > > > > > > > and cannot be obtained using JMX or other external
>> tools. [1]
>> >> > > > > > > > >
>> >> > > > > > > > > I made the list of local node metrics that help to
>> understand
>> >> > > the
>> >> > > > > > > > > actual status of current PME:
>> >> > > > > > > > >
>> >> > > > > > > > > 1. initialVersion. Topology version that initiates the
>> >> > > exchange.
>> >> > > > > > > > > 2. initTime. Time PME was started.
>> >> > > > > > > > > 3. initEvent. Event that triggered PME.
>> >> > > > > > > > > 4. partitionReleaseTime. Time when a node has finished
>> waiting
>> >> > > for
>> >> > > > > >
>> >> > > > > > all
>> >> > > > > > > > > updates and translations on a previous topology.
>> >> > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a
>> single
>> >> > > message.
>> >> > > > > > > > > 6. recieveFullMessageTime. Time when a node received a
>> full
>> >> > > message.
>> >> > > > > > > > > 7. finishTime. Time PME was ended.
>> >> > > > > > > > >
>> >> > > > > > > > > When new PME started all these metrics resets.
>> >> > > > > > > > >
>> >> > > > > > > > > These metrics help to understand:
>> >> > > > > > > > > - how long PME was (current or previous).
>> >> > > > > > > > > - how long awaited for all updates was completed.
>> >> > > > > > > > > - what node blocks PME (didn't send a single message)
>> >> > > > > > > > > - what triggered PME.
>> >> > > > > > > > >
>> >> > > > > > > > > Thoughts?
>> >> > > > > > > > >
>> >> > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
>> >> > > > > > > > >
>> >> > > > > > > > > --
>> >> > > > > > > > > Best wishes,
>> >> > > > > > > > > Amelchev Nikita
>> >> > > > > > > > >
>> >> > >
>> >> > >
>> >> > >
>> >> > > --
>> >> > > Best wishes,
>> >> > > Amelchev Nikita
>> >> > >
>> >>
>> >>
>> >>
>> >> --
>> >> Best wishes,
>> >> Amelchev Nikita
>>
>>
>>
>> --
>> Best wishes,
>> Amelchev Nikita
>>
>

Re: Partition map exchange metrics

Reply via email to