Re: Re[2]: Cache operations performance metrics

Andrey Gura Fri, 20 Dec 2019 06:21:46 -0800

> but between to have something and have nothing i choose — something

We already have "something". put, get, etc. metrics. As I told early
it relatively useless. But the same metrics with histograms doesn't
add any value.


> i found 1 grid machine with very different io usage than others, «dig deeper» 
> highlight cache with very different from other nodes cache put operations and 
> final «dig deeper» help to found code bug

I believe the same could be noticed using PK index stats.

> if new one would be more useful — why not ?

If some particular value is relatively useless then the same histogram
will be still relatively useless :) It's my point. Stop adding a dozen
of metrics, start thinking about benefits and meaning. Discuss it with
community.


On Fri, Dec 20, 2019 at 4:59 PM Zhenya Stanilovsky
<arzamas...@mail.ru.invalid> wrote:
>
>
> >> Is it become slower or faster?
> >
> >Very correct question! User saw "cache put time" metric becomes x2
> >bigger. Does it become slower or faster? Or we just put into the cache
> >values that 4x bigger in size? Or all time before we put values
> >locally and now we put values on remote nodes. Or our operations
> >execute in transaction and then time will depend on transaction type,
> >actions in transaction and other transaction and actually will nothing
> >talk about real cache operation. We have more questions then answers.
>
> Andrey, i hope i understand your point of view here, but between to have 
> something and have nothing i choose — something, it sometimes really helpful. 
> From real life case: i found 1 grid machine with very different io usage than 
> others, «dig deeper» highlight cache with very different from other nodes 
> cache put operations and final «dig deeper» help to found code bug, but to be 
> clear — old mechanism works ok for me here, if new one would be more useful — 
> why not ?
>
> >> On the other hand - if `PuTime` increased - then we know for sure, all 
> >> operation executing `put` becomes slower.
> >
> >Of course not :) See above.
> >
> >On Fri, Dec 20, 2019 at 3:20 PM Николай Ижиков < nizhi...@apache.org > wrote:
> >>
> >> > It also will be visible on other metrics
> >>
> >> How will it be visible?
> >>
> >> For example, the user saw «checkpoint time» metric becomes x2 bigger.
> >> How it relates to business operations? Is it become slower or faster?
> >> What does it mean for an application performance?
> >>
> >> On the other hand - if `PuTime` increased - then we know for sure, all 
> >> operation executing `put` becomes slower.
> >>
> >> *Why* it’s become slower - is the essence of «go deeper» investigation.
> >>
> >> > 20 дек. 2019 г., в 15:07, Andrey Gura < ag...@apache.org > написал(а):
> >> >
> >> >> If a cache has some percent of the relatively slow transaction this is 
> >> >> a trigger to make a deeper investigation.
> >> >
> >> > It also will be visible on other metrics. So cache operations metrics
> >> > still useless because it transitive values.
> >> >
> >> >>> 1. Measure some important internals (WAL operations, checkpoint time, 
> >> >>> etc) because it can talk about real problems.
> >> >
> >> >> We already implement it.
> >> >
> >> > I don't talk that it isn't implemented. It is just example of things
> >> > that should be measured. All other metrics depends on internals.
> >> >
> >> >>> 2. Measure business operations in user context, not cache API 
> >> >>> operations.
> >> >
> >> >> Why do you think these approaches should exclude one another?
> >> >
> >> > Because one of them is useless.
> >> >
> >> > On Fri, Dec 20, 2019 at 1:43 PM Николай Ижиков < nizhi...@apache.org > 
> >> > wrote:
> >> >>
> >> >> Hello, Andrey.
> >> >>
> >> >>> Where the sense in this value? I explained why this metrics are 
> >> >>> relatively useless.
> >> >>
> >> >> I don’t agree with you.
> >> >> I believe they are not useless for a user.
> >> >> And I try to explain why I think so.
> >> >>
> >> >>> But user can't distinguish one transaction from another, so his 
> >> >>> knowledge doesn't make sense definitely.
> >> >>
> >> >> Users shouldn’t distinguish.
> >> >> If a cache has some percent of the relatively slow transaction this is 
> >> >> a trigger to make a deeper investigation.
> >> >>
> >> >>> 1. Measure some important internals (WAL operations, checkpoint time, 
> >> >>> etc) because it can talk about real problems.
> >> >>
> >> >> We already implement it.
> >> >> What metrics are missing for internal processes?
> >> >>
> >> >>> 2. Measure business operations in user context, not cache API 
> >> >>> operations.
> >> >>
> >> >> Why do you think these approaches should exclude one another?
> >> >> Users definitely should measure whole business transaction performance.
> >> >>
> >> >> I think we should provide a way to measure part of the business 
> >> >> transaction that relates to the Ignite.
> >> >>
> >> >>
> >> >>> 20 дек. 2019 г., в 13:02, Andrey Gura < ag...@apache.org > написал(а):
> >> >>>
> >> >>>> The goal of the proposed metrics is to measure whole cache operations 
> >> >>>> behavior.
> >> >>>> It provides some kind of statistics(histograms) for it.
> >> >>>
> >> >>> Nikolay, reformulating doesn't make metrics more meaningful. Seriously 
> >> >>> :)
> >> >>>
> >> >>>> Yes, metrics will evaluate API call performance
> >> >>>
> >> >>> And what? Where the sense in this value? I explained why this metrics
> >> >>> are relatively useless.
> >> >>>
> >> >>>> These are metrics of client-side operation performance.
> >> >>>
> >> >>> Again. It's just a number without any sense.
> >> >>>
> >> >>>> I think a specific user has knowledge - what are his transactions.
> >> >>>
> >> >>> May be. But user can't distinguish one transaction from another, so
> >> >>> his knowledge doesn't make sense definitely.
> >> >>>
> >> >>>> From these metrics it can answer on the question «If my transaction 
> >> >>>> includes cacheXXX, how long it usually takes?»
> >> >>>
> >> >>> Actually not. The same caches can be involved in a dozen of
> >> >>> transactions and there are no ways to understand what transactions are
> >> >>> slow or fast. It is useless.
> >> >>>
> >> >>>> I disagree here.
> >> >>>> If you have a better approach to measure cache operations performance 
> >> >>>> - please, share your vision.
> >> >>>
> >> >>> I already wrote about better approach. Two main points:
> >> >>>
> >> >>> 1. Measure some important internals (WAL operations, checkpoint time,
> >> >>> etc) because it can talk about real problems.
> >> >>> 2. Measure business operations in user context, not cache API 
> >> >>> operations.
> >> >>>
> >> >>> So what we have? We have useless metrics that are doubled by useless
> >> >>> histograms.
> >> >>>
> >> >>> We should reconsider approach to metrics and performance measuring. It
> >> >>> is hard and long task. There are no need to commit tons of useless
> >> >>> metrics that just decrease performance.
> >> >>>
> >> >>> Sorry for some sarcasm but I really believe in my opinion. Metrics
> >> >>> problem exists very very long time and existing metrics discussed many
> >> >>> times. No one can explain this metrics to users because it requires
> >> >>> too many additional knowledge about internals. And metric value
> >> >>> itself depends on many aspects of internals. It leads to impossibility
> >> >>> of interpretation. And it's good time to remove it (in AI 3.0 due to a
> >> >>> backward compatibility).
> >> >>>
> >> >>> On Thu, Dec 19, 2019 at 9:09 PM Николай Ижиков < 
> >> >>> nizhikov....@gmail.com > wrote:
> >> >>>>
> >> >>>> Hello, Andrey.
> >> >>>>
> >> >>>> The goal of the proposed metrics is to measure whole cache operations 
> >> >>>> behavior.
> >> >>>> It provides some kind of statistics(histograms) for it.
> >> >>>> For more fine-grained analysis one will be use tracing or other «go 
> >> >>>> deeper» tools.
> >> >>>>
> >> >>>>>> Measured for API calls on the caller node side
> >> >>>>> Values will the same only for cases when node is remote relative to 
> >> >>>>> data
> >> >>>>
> >> >>>> Yes, metrics will evaluate API call performance.
> >> >>>> I think this is the most valuable information from a user's point of 
> >> >>>> view.
> >> >>>>
> >> >>>> Regular user wants to know how fast his cache operation performs.
> >> >>>> And these metrics provide the answer.
> >> >>>>
> >> >>>>> For regular data node (server node) timing will depend on answers 
> >> >>>>> for question:
> >> >>>>
> >> >>>> I think these answers are always available.
> >> >>>> I barely can imagine a scenario when one monitor «black box» cluster 
> >> >>>> and don’t know it.
> >> >>>> Even so, all answers are provided through system view we brought to 
> >> >>>> the Ignite :)
> >> >>>>
> >> >>>>> What is transaction commit or rollback time?
> >> >>>>
> >> >>>> These are metrics of client-side operation performance.
> >> >>>>
> >> >>>> I think a specific user has knowledge - what are his transactions.
> >> >>>> From these metrics it can answer on the question «If my transaction 
> >> >>>> includes cacheXXX, how long it usually takes?»
> >> >>>> I think it’s very valuable knowledge.
> >> >>>>
> >> >>>>> It will be implemented for most types of messages.
> >> >>>>
> >> >>>> Good, let’s do it?
> >> >>>>
> >> >>>>> So, from my point of view, commits for get/put/remove and 
> >> >>>>> commit/rollback should be reverted.
> >> >>>>
> >> >>>> I disagree here.
> >> >>>> If you have a better approach to measure cache operations performance 
> >> >>>> - please, share your vision.
> >> >>>>
> >> >>>>> 19 дек. 2019 г., в 16:03, Andrey Gura < ag...@apache.org > 
> >> >>>>> написал(а):
> >> >>>>>
> >> >>>>> From my point of view, Ignite should provide meaningful metrics for
> >> >>>>> internal components that could be useful for monitoring and analysis.
> >> >>>>> All suggested options are meaningless in a sense. Below I'll try
> >> >>>>> explain why.
> >> >>>>>
> >> >>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls on 
> >> >>>>>> the caller node side.
> >> >>>>>> Implemented in [1], commit [2].
> >> >>>>>
> >> >>>>> All cache operations in Ignite are distributed. So each value 
> >> >>>>> measured
> >> >>>>> for some cache operation will vary depending on where actually
> >> >>>>> operation is performed. Values will the same only for cases when node
> >> >>>>> is remote relative to data (e.g. client node).
> >> >>>>>
> >> >>>>> For regular data node (server node) timing will depend on answers 
> >> >>>>> for question:
> >> >>>>>
> >> >>>>> - is node primary for particular key or not? (for all operations)
> >> >>>>> - how many backups configured for the cache? (for put and remove)
> >> >>>>> - what write synchronization mode is configured for particular cache?
> >> >>>>> (for put and remove)
> >> >>>>> - is readFromBackup enabled for the cache? (for get)
> >> >>>>>
> >> >>>>> Both Ignite users and Ignite developers can't make any decision based
> >> >>>>> on this metrics.
> >> >>>>>
> >> >>>>>> * `commit`, `rollback` time histograms. Measured for API calls on 
> >> >>>>>> the caller node side [3].
> >> >>>>>
> >> >>>>> What is transaction commit or rollback time? How it calculates in
> >> >>>>> Ignite now? What actions included into transaction? What actions not
> >> >>>>> related with cache executed during transactions?
> >> >>>>>
> >> >>>>> There is no any sense in time of transaction commit or rollback
> >> >>>>> because there are no any way to understand what transaction was
> >> >>>>> performed in particular period of time. Usually a lot of transactions
> >> >>>>> and we can't to distinguish from each other.
> >> >>>>>
> >> >>>>> Moreover, transaction usually treats as business operation. So only
> >> >>>>> way to measure performance properly is measure business operation
> >> >>>>> time. That is user should create own metrics set for some business
> >> >>>>> API.
> >> >>>>>
> >> >>>>> Further. What about cross cache transactions? At the moment tx
> >> >>>>> commit/rollback time will be added to corresponding metrics per each
> >> >>>>> cache evolved to the transaction. The *same time* for *each cache*.
> >> >>>>> Absolutely meaningless.
> >> >>>>>
> >> >>>>> Again, both Ignite users and Ignite developers can't make any 
> >> >>>>> decision
> >> >>>>> based on this metrics. But users can create own metrics set.
> >> >>>>>
> >> >>>>>> * histograms that measure the time of processing `get`, `put`, 
> >> >>>>>> `remove`, `commit`, `rollback` messages on affinity nodes(primary 
> >> >>>>>> and backups).
> >> >>>>>> Ticket doesn't exist for it.
> >> >>>>>
> >> >>>>> It will be implemented for most types of messages.
> >> >>>>>
> >> >>>>> Metrics, application monitoring, performance analysis and measurement
> >> >>>>> are a a little harder than it sounds. Therefore, we must approach 
> >> >>>>> this
> >> >>>>> issue more carefully.
> >> >>>>> Blindly adding new types of metrics will not only not improve the
> >> >>>>> situation, but will also worsen the overall performance of the system
> >> >>>>> because metric calculation always on the hot path.
> >> >>>>>
> >> >>>>> So, from my point of view, commits for get/put/remove and
> >> >>>>> commit/rollback should be reverted.
> >> >>>>>
> >> >>>>> On Mon, Dec 16, 2019 at 5:39 PM Nikita Amelchev < 
> >> >>>>> nsamelc...@gmail.com > wrote:
> >> >>>>>>
> >> >>>>>> I think these metrics are useful.
> >> >>>>>>
> >> >>>>>> I have prepared PR [1] for commit and rollback histograms. [2]
> >> >>>>>> Nikolay, could you take a look, please?
> >> >>>>>>
> >> >>>>>> If you do not mind, I will try to add affinity-nodes cache metrics:
> >> >>>>>>>> * histograms that measure the time of processing `get`, `put`, 
> >> >>>>>>>> `remove`, `commit`, `rollback` messages on affinity nodes(primary 
> >> >>>>>>>> and backups). Ticket doesn't exist for it.
> >> >>>>>>
> >> >>>>>> I have filed a ticket for it. [3]
> >> >>>>>>
> >> >>>>>> [1]  https://github.com/apache/ignite/pull/7141
> >> >>>>>> [2]  https://issues.apache.org/jira/browse/IGNITE-12450
> >> >>>>>> [3]  https://issues.apache.org/jira/browse/IGNITE-12453
> >> >>>>>>
> >> >>>>>> пн, 16 дек. 2019 г. в 11:07, Alexei Scherbakov < 
> >> >>>>>> alexey.scherbak...@gmail.com >:
> >> >>>>>>>
> >> >>>>>>> I think they are very useful.
> >> >>>>>>>
> >> >>>>>>> пн, 16 дек. 2019 г. в 10:51, Николай Ижиков < nizhi...@apache.org 
> >> >>>>>>> >:
> >> >>>>>>>
> >> >>>>>>>> Hello, Alexei.
> >> >>>>>>>>
> >> >>>>>>>> Thanks for the link on the ticket, lableled it with the IEP-35 
> >> >>>>>>>> label.
> >> >>>>>>>> What do you think about proposed metrics set?
> >> >>>>>>>>
> >> >>>>>>>>> 16 дек. 2019 г., в 10:29, Alexei Scherbakov <
> >> >>>>>>>>  alexey.scherbak...@gmail.com > написал(а):
> >> >>>>>>>>>
> >> >>>>>>>>> Nikolay,
> >> >>>>>>>>>
> >> >>>>>>>>> What about batch operations?
> >> >>>>>>>>>
> >> >>>>>>>>> For messages processing the ticket does exist and even has an
> >> >>>>>>>>> implementation from before new metrics API times [1]
> >> >>>>>>>>>
> >> >>>>>>>>> [1]  https://issues.apache.org/jira/browse/IGNITE-10418
> >> >>>>>>>>>
> >> >>>>>>>>> пн, 16 дек. 2019 г. в 10:12, Николай Ижиков < 
> >> >>>>>>>>> nizhi...@apache.org >:
> >> >>>>>>>>>
> >> >>>>>>>>>> Hello, Igniters.
> >> >>>>>>>>>>
> >> >>>>>>>>>> I want to provide the user answers to the following question: 
> >> >>>>>>>>>> "How cache
> >> >>>>>>>>>> API operations perform?"
> >> >>>>>>>>>> It seems, we need to implements metrics for basic cache API 
> >> >>>>>>>>>> operations
> >> >>>>>>>>>> like get, put, remove for it.
> >> >>>>>>>>>>
> >> >>>>>>>>>> I think we should provide the following metrics:
> >> >>>>>>>>>>
> >> >>>>>>>>>> * `get`, `put`, `remove` time histograms. Measured for API 
> >> >>>>>>>>>> calls on the
> >> >>>>>>>>>> caller node side.
> >> >>>>>>>>>> Implemented in [1], commit [2].
> >> >>>>>>>>>>
> >> >>>>>>>>>> * `commit`, `rollback` time histograms. Measured for API calls 
> >> >>>>>>>>>> on the
> >> >>>>>>>>>> caller node side [3].
> >> >>>>>>>>>>
> >> >>>>>>>>>> * histograms that measure the time of processing `get`, `put`, 
> >> >>>>>>>>>> `remove`,
> >> >>>>>>>>>> `commit`, `rollback` messages on affinity nodes(primary and 
> >> >>>>>>>>>> backups).
> >> >>>>>>>>>> Ticket doesn't exist for it.
> >> >>>>>>>>>>
> >> >>>>>>>>>> What do you think?
> >> >>>>>>>>>>
> >> >>>>>>>>>> [1]  https://issues.apache.org/jira/browse/IGNITE-12219
> >> >>>>>>>>>> [2]
> >> >>>>>>>>>>
> >> >>>>>>>>  
> >> >>>>>>>> https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364
> >> >>>>>>>>>> [3]  https://issues.apache.org/jira/browse/IGNITE-12450
> >> >>>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> --
> >> >>>>>>>>>
> >> >>>>>>>>> Best regards,
> >> >>>>>>>>> Alexei Scherbakov
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>>> --
> >> >>>>>>>
> >> >>>>>>> Best regards,
> >> >>>>>>> Alexei Scherbakov
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> --
> >> >>>>>> Best wishes,
> >> >>>>>> Amelchev Nikita
> >> >>>>
> >> >>
> >>
>
>
>
>

Re: Re[2]: Cache operations performance metrics

Reply via email to