Re[2]: Partition map exchange metrics

Zhenya Stanilovsky Tue, 23 Jul 2019 23:11:36 -0700

+1 with Anton decisions.


>Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov <a...@apache.org>:
>
>Folks,
>
>It looks like we're trying to implement "extended debug" instead of
>"monitoring".
>It should not be interesting for real admin what phase of PME is in
>progress and so on.
>Interested metrics are
>- total blocked time (will be used for real SLA counting)
>- are we blocked right now (shows we have an SLA degradation right now)
>Duration of the current blocking period can be easily presented using any
>modern monitoring tool by regular checks.
>Initial true will means "period start", precision will be a result of
>checks frequency.
>Anyway, I'm ok to have current metric presented with long, where long is a
>duration, see no reason, but ok :)
>
>All other features you mentioned are useful for code or
>deployment improving and can (should) be taken from logs at the analysis
>phase.
>
>On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < ivan.glu...@gmail.com > wrote:
>
>> Folks, let me step in.
>>
>> Nikita, thanks for your suggestions!
>>
>> > 1. initialVersion. Topology version that initiates the exchange.
>> > 2. initTime. Time PME was started.
>> > 3. initEvent. Event that triggered PME.
>> > 4. partitionReleaseTime. Time when a node has finished waiting for all
>> > updates and translations on a previous topology.
>> > 5. sendSingleMessageTime. Time when a node sent a single message.
>> > 6. recieveFullMessageTime. Time when a node received a full message.
>> > 7. finishTime. Time PME was ended.
>> >
>> > When new PME started all these metrics resets.
>> Every metric from Nikita's list looks useful and simple to implement.
>> I think that it would be better to change format of metrics 4, 5, 6 and
>> 7 a bit: we can keep only difference between time of previous event and
>> time of corresponding event. Such metrics would be easier to perceive:
>> they answer to specific questions "how much time did partition release
>> take?" or "how much time did awaiting of distributed phase end take?".
>> Also, if results of 4, 5, 6, 7 will be exported to monitoring system,
>> graphs will show how different stages times change from one PME to another.
>>
>> > When PME cause no blocking, it's a good PME and I see no reason to have
>> > monitoring related to it
>> Agree with Anton here. These metrics should be measured only for true
>> distributed exchange. Saving results for client leave/join PMEs will
>> just complicate monitoring.
>>
>> > I agree with total blocking duration metric but
>> > I still don't understand why instant value indicating that operations are
>> > blocked should be boolean.
>> > Duration time since blocking has started looks more appropriate and
>> useful.
>> > It gives more information while semantic is left the same.
>> Totally agree with Pavel here. Both "accumulated block time" and
>> "current PME block time" metrics are useful. Growth of accumulated
>> metric for specific period of time (should be easy to check via
>> monitoring system graph) will show for how much business operations were
>> blocked in total, and non-zero current metric will show that we are
>> experiencing issues right now. Boolean metric "are we blocked right now"
>> is not needed as it's obviously can be inferred from "current PME block
>> time".
>>
>> Best Regards,
>> Ivan Rakov
>>
>> On 23.07.2019 16:02, Pavel Kovalenko wrote:
>> > Nikita,
>> >
>> > I agree with total blocking duration metric but
>> > I still don't understand why instant value indicating that operations are
>> > blocked should be boolean.
>> > Duration time since blocking has started looks more appropriate and
>> useful.
>> > It gives more information while semantic is left the same.
>> >
>> >
>> >
>> > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < nsamelc...@gmail.com >:
>> >
>> >> Folks,
>> >>
>> >> All previous suggestions have some disadvantages. It can be several
>> >> exchanges between two metric updates and fast exchange can rewrite
>> >> previous long exchange.
>> >>
>> >> We can introduce a metric of total blocking duration that will
>> >> accumulate at the end of the exchange. So, users will get actual
>> >> information about how long operations were blocked. Cluster metric
>> >> will be a maximum of local nodes metrics. And we need a boolean metric
>> >> that will indicate realtime status. It needs because of duration
>> >> metric updates at the end of the exchange.
>> >>
>> >> So I propose to change the current metric that not released to the
>> >> totalCacheOperationsBlockingDuration metric and to add the
>> >> isCacheOperationsBlocked metric.
>> >>
>> >> WDYT?
>> >>
>> >> пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov < a...@apache.org >:
>> >>> Nikolay,
>> >>>
>> >>> Still see no reason to replace boolean with long.
>> >>>
>> >>> On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov < nizhi...@apache.org >
>> >> wrote:
>> >>>> Anton.
>> >>>>
>> >>>> 1. Value exported based on SPI settings, not in the moment it changed.
>> >>>>
>> >>>> 2. Clock synchronisation - if we export start time, we should also
>> >> export
>> >>>> node local timestamp.
>> >>>>
>> >>>> пн, 22 июля 2019 г., 8:33 Anton Vinogradov < a...@apache.org >:
>> >>>>
>> >>>>> Folks,
>> >>>>>
>> >>>>> What's the reason for duration counting?
>> >>>>> AFAIU, it's a monitoring system feature to count the durations.
>> >>>>> Sine monitoring system checks metrics periodically it will know the
>> >>>>> duration by its own log.
>> >>>>>
>> >>>>> On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko < jokse...@gmail.com >
>> >>>>> wrote:
>> >>>>>
>> >>>>>> Nikita,
>> >>>>>>
>> >>>>>> Yes, I mean duration not timestamp. For the metric name, I suggest
>> >>>>>> "cacheOperationsBlockingDuration", I think it cleaner represents
>> >> what
>> >>>> is
>> >>>>>> blocked during PME.
>> >>>>>> We can also combine both timestamp
>> >> "cacheOperationsBlockingStartTs" and
>> >>>>>> duration to have better correlation when cache operations were
>> >> blocked
>> >>>>> and
>> >>>>>> how much time it's taken.
>> >>>>>> For instant view (like in JMX bean) a calculated value as you
>> >> mentioned
>> >>>>>> can be used.
>> >>>>>> For metrics are exported to some backend (IEP-35) a counter can be
>> >>>> used.
>> >>>>>> The counter is incremented by blocking time after blocking has
>> >> ended.
>> >>>>>> пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev < nsamelc...@gmail.com
>> >>> :
>> >>>>>>> Pavel,
>> >>>>>>>
>> >>>>>>> The main purpose of this metric is
>> >>>>>>>>> how much time we wait for resuming cache operations
>> >>>>>>> Seems I misunderstood you. Do you mean timestamp or duration here?
>> >>>>>>>>> What do you think if we change the boolean value of metric to a
>> >>>> long
>> >>>>>>> value that represents time in milliseconds when operations were
>> >>>> blocked?
>> >>>>>>> This time can be calculated as (currentTime -
>> >>>>>>> timeSinceOperationsBlocked) in case of timestamp.
>> >>>>>>>
>> >>>>>>> Duration will be more understandable. It'll be something like
>> >>>>>>> getCurrentBlockingPmeDuration. But I haven't come up with a better
>> >>>>>>> name yet.
>> >>>>>>>
>> >>>>>>> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko < jokse...@gmail.com
>> >>> :
>> >>>>>>>> Nikita,
>> >>>>>>>>
>> >>>>>>>> I think getCurrentPmeDuration doesn't show useful information.
>> >> The
>> >>>>> main
>> >>>>>>> PME side effect for end-users is blocking cache operations. Not
>> >> all
>> >>>> PME
>> >>>>>>> time blocks it.
>> >>>>>>>> What information gives to an end-user timestamp of
>> >>>>>>> "timeSinceOperationsBlocked"? For what analysis it can be used and
>> >>>> how?
>> >>>>>>>> пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <
>> >>  nsamelc...@gmail.com
>> >>>>> :
>> >>>>>>>>> Hi Pavel,
>> >>>>>>>>>
>> >>>>>>>>> This time already can be obtained from the
>> >> getCurrentPmeDuration
>> >>>> and
>> >>>>>>>>> new isOperationsBlockedByPme metrics.
>> >>>>>>>>>
>> >>>>>>>>> As an alternative solution, I can rework recently added
>> >>>>>>>>> getCurrentPmeDuration metric (not released yet). Seems for
>> >> users it
>> >>>>>>>>> useless in case of non-blocking PME.
>> >>>>>>>>> Lets name it timeSinceOperationsBlocked. It'll be timestamp
>> >> when
>> >>>>>>>>> blocking started (minimal value of cluster nodes) and 0 if
>> >> blocking
>> >>>>>>>>> ends (there is no running PME).
>> >>>>>>>>>
>> >>>>>>>>> WDYT?
>> >>>>>>>>>
>> >>>>>>>>> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <
>> >>  jokse...@gmail.com >:
>> >>>>>>>>>> Hi Nikita,
>> >>>>>>>>>>
>> >>>>>>>>>> Thank you for working on this. What do you think if we
>> >> change the
>> >>>>>>> boolean
>> >>>>>>>>>> value of metric to a long value that represents time in
>> >>>>> milliseconds
>> >>>>>>> when
>> >>>>>>>>>> operations were blocked?
>> >>>>>>>>>> Since we have not only JMX and now metrics are periodically
>> >>>>> exported
>> >>>>>>> to
>> >>>>>>>>>> some backend it can give a more clear picture of how much
>> >> time we
>> >>>>>>> wait for
>> >>>>>>>>>> resuming cache operations instead of instant boolean
>> >> indicator.
>> >>>>>>>>>> пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <
>> >>>>  nsamelc...@gmail.com
>> >>>>>> :
>> >>>>>>>>>>> Anton, Nikolay,
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thanks for the support.
>> >>>>>>>>>>>
>> >>>>>>>>>>> For now, we have the getCurrentPmeDuration() metric that
>> >> does
>> >>>> not
>> >>>>>>> show
>> >>>>>>>>>>> influence on the cluster correctly. PME can be without
>> >> blocking
>> >>>>>>>>>>> operations. For example, client node join/leave events.
>> >>>>>>>>>>>
>> >>>>>>>>>>> I suggest add new metric - isOperationsBlockedByPme().
>> >>>> Together,
>> >>>>>>> these
>> >>>>>>>>>>> metrics will show influence of the PME on cluster and user
>> >>>>>>> operations.
>> >>>>>>>>>>> I have prepared PR for this (Bot visa is green). [1] Can
>> >> anyone
>> >>>>>>> take a
>> >>>>>>>>>>> look?
>> >>>>>>>>>>>
>> >>>>>>>>>>> [1]  https://issues.apache.org/jira/browse/IGNITE-11961
>> >>>>>>>>>>>
>> >>>>>>>>>>> вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <
>> >>>>>  nizhi...@apache.org
>> >>>>>>>> :
>> >>>>>>>>>>>> I think administator of Ignite cluster should be able to
>> >>>>> monitor
>> >>>>>>> all
>> >>>>>>>>>>> Ignite process, including non blocking PME.
>> >>>>>>>>>>>> В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
>> >>>>>>>>>>>>> BTW,
>> >>>>>>>>>>>>> Found PME metric - getCurrentPmeDuration().
>> >>>>>>>>>>>>> Seems, it shows exactly PME time and not so useful
>> >> because
>> >>>> of
>> >>>>>>> this.
>> >>>>>>>>>>>>> The goal it so show exactly blocking period.
>> >>>>>>>>>>>>> When PME cause no blocking, it's a good PME and I see
>> >> no
>> >>>>>>> reason to have
>> >>>>>>>>>>>>> monitoring related to it :)
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
>> >>>>>>>  nizhi...@apache.org >
>> >>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>> Anton.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Why do we need to postpone implementation of this
>> >>>> metrics?
>> >>>>>>>>>>>>>> For now, implementation of new metric is very simple.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> I think we can implement this metrics as a single
>> >>>>>>> contribution.
>> >>>>>>>>>>>>>> В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov
>> >> пишет:
>> >>>>>>>>>>>>>>> Nikita,
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Looks like all we need now is a 1 simple metric:
>> >> are
>> >>>>>>> operations
>> >>>>>>>>>>> blocked?
>> >>>>>>>>>>>>>>> Just a true or false.
>> >>>>>>>>>>>>>>> Lest start from this.
>> >>>>>>>>>>>>>>> All other metrics can be extracted from logs now
>> >> and
>> >>>> can
>> >>>>> be
>> >>>>>>>>>>> implemented
>> >>>>>>>>>>>>>>> later.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
>> >>>>>>>>>>>  nizhi...@apache.org >
>> >>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> +1.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Nikita, please, go ahead.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
>> >>>>>>>  nsamelc...@gmail.com
>> >>>>>>>>>>>> :
>> >>>>>>>>>>>>>>>>> Hello, Igniters.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> I suggest to add some useful metrics about the
>> >>>>>>> partition map
>> >>>>>>>>>>> exchange
>> >>>>>>>>>>>>>>>>> (PME). For now, the duration of PME stages
>> >>>> available
>> >>>>>>> only in
>> >>>>>>>>>>> log
>> >>>>>>>>>>>>>> files
>> >>>>>>>>>>>>>>>>> and cannot be obtained using JMX or other
>> >> external
>> >>>>>>> tools. [1]
>> >>>>>>>>>>>>>>>>> I made the list of local node metrics that
>> >> help to
>> >>>>>>> understand
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> actual status of current PME:
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> 1. initialVersion. Topology version that
>> >> initiates
>> >>>>> the
>> >>>>>>>>>>> exchange.
>> >>>>>>>>>>>>>>>>> 2. initTime. Time PME was started.
>> >>>>>>>>>>>>>>>>> 3. initEvent. Event that triggered PME.
>> >>>>>>>>>>>>>>>>> 4. partitionReleaseTime. Time when a node has
>> >>>>> finished
>> >>>>>>> waiting
>> >>>>>>>>>>> for
>> >>>>>>>>>>>>>> all
>> >>>>>>>>>>>>>>>>> updates and translations on a previous
>> >> topology.
>> >>>>>>>>>>>>>>>>> 5. sendSingleMessageTime. Time when a node
>> >> sent a
>> >>>>>>> single
>> >>>>>>>>>>> message.
>> >>>>>>>>>>>>>>>>> 6. recieveFullMessageTime. Time when a node
>> >>>> received
>> >>>>> a
>> >>>>>>> full
>> >>>>>>>>>>> message.
>> >>>>>>>>>>>>>>>>> 7. finishTime. Time PME was ended.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> When new PME started all these metrics resets.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> These metrics help to understand:
>> >>>>>>>>>>>>>>>>> - how long PME was (current or previous).
>> >>>>>>>>>>>>>>>>> - how long awaited for all updates was
>> >> completed.
>> >>>>>>>>>>>>>>>>> - what node blocks PME (didn't send a single
>> >>>> message)
>> >>>>>>>>>>>>>>>>> - what triggered PME.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Thoughts?
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> [1]
>> >>>>>  https://issues.apache.org/jira/browse/IGNITE-11961
>> >>>>>>>>>>>>>>>>> --
>> >>>>>>>>>>>>>>>>> Best wishes,
>> >>>>>>>>>>>>>>>>> Amelchev Nikita
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> --
>> >>>>>>>>>>> Best wishes,
>> >>>>>>>>>>> Amelchev Nikita
>> >>>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> --
>> >>>>>>>>> Best wishes,
>> >>>>>>>>> Amelchev Nikita
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Best wishes,
>> >>>>>>> Amelchev Nikita
>> >>>>>>>
>> >>
>> >>
>> >> --
>> >> Best wishes,
>> >> Amelchev Nikita
>> >>
>>


-- 
Zhenya Stanilovsky

Re[2]: Partition map exchange metrics

Reply via email to