> but between to have something and have nothing i choose — something We already have "something". put, get, etc. metrics. As I told early it relatively useless. But the same metrics with histograms doesn't add any value.
> i found 1 grid machine with very different io usage than others, «dig deeper» > highlight cache with very different from other nodes cache put operations and > final «dig deeper» help to found code bug I believe the same could be noticed using PK index stats. > if new one would be more useful — why not ? If some particular value is relatively useless then the same histogram will be still relatively useless :) It's my point. Stop adding a dozen of metrics, start thinking about benefits and meaning. Discuss it with community. On Fri, Dec 20, 2019 at 4:59 PM Zhenya Stanilovsky <arzamas...@mail.ru.invalid> wrote: > > > >> Is it become slower or faster? > > > >Very correct question! User saw "cache put time" metric becomes x2 > >bigger. Does it become slower or faster? Or we just put into the cache > >values that 4x bigger in size? Or all time before we put values > >locally and now we put values on remote nodes. Or our operations > >execute in transaction and then time will depend on transaction type, > >actions in transaction and other transaction and actually will nothing > >talk about real cache operation. We have more questions then answers. > > Andrey, i hope i understand your point of view here, but between to have > something and have nothing i choose — something, it sometimes really helpful. > From real life case: i found 1 grid machine with very different io usage than > others, «dig deeper» highlight cache with very different from other nodes > cache put operations and final «dig deeper» help to found code bug, but to be > clear — old mechanism works ok for me here, if new one would be more useful — > why not ? > > >> On the other hand - if `PuTime` increased - then we know for sure, all > >> operation executing `put` becomes slower. > > > >Of course not :) See above. > > > >On Fri, Dec 20, 2019 at 3:20 PM Николай Ижиков < nizhi...@apache.org > wrote: > >> > >> > It also will be visible on other metrics > >> > >> How will it be visible? > >> > >> For example, the user saw «checkpoint time» metric becomes x2 bigger. > >> How it relates to business operations? Is it become slower or faster? > >> What does it mean for an application performance? > >> > >> On the other hand - if `PuTime` increased - then we know for sure, all > >> operation executing `put` becomes slower. > >> > >> *Why* it’s become slower - is the essence of «go deeper» investigation. > >> > >> > 20 дек. 2019 г., в 15:07, Andrey Gura < ag...@apache.org > написал(а): > >> > > >> >> If a cache has some percent of the relatively slow transaction this is > >> >> a trigger to make a deeper investigation. > >> > > >> > It also will be visible on other metrics. So cache operations metrics > >> > still useless because it transitive values. > >> > > >> >>> 1. Measure some important internals (WAL operations, checkpoint time, > >> >>> etc) because it can talk about real problems. > >> > > >> >> We already implement it. > >> > > >> > I don't talk that it isn't implemented. It is just example of things > >> > that should be measured. All other metrics depends on internals. > >> > > >> >>> 2. Measure business operations in user context, not cache API > >> >>> operations. > >> > > >> >> Why do you think these approaches should exclude one another? > >> > > >> > Because one of them is useless. > >> > > >> > On Fri, Dec 20, 2019 at 1:43 PM Николай Ижиков < nizhi...@apache.org > > >> > wrote: > >> >> > >> >> Hello, Andrey. > >> >> > >> >>> Where the sense in this value? I explained why this metrics are > >> >>> relatively useless. > >> >> > >> >> I don’t agree with you. > >> >> I believe they are not useless for a user. > >> >> And I try to explain why I think so. > >> >> > >> >>> But user can't distinguish one transaction from another, so his > >> >>> knowledge doesn't make sense definitely. > >> >> > >> >> Users shouldn’t distinguish. > >> >> If a cache has some percent of the relatively slow transaction this is > >> >> a trigger to make a deeper investigation. > >> >> > >> >>> 1. Measure some important internals (WAL operations, checkpoint time, > >> >>> etc) because it can talk about real problems. > >> >> > >> >> We already implement it. > >> >> What metrics are missing for internal processes? > >> >> > >> >>> 2. Measure business operations in user context, not cache API > >> >>> operations. > >> >> > >> >> Why do you think these approaches should exclude one another? > >> >> Users definitely should measure whole business transaction performance. > >> >> > >> >> I think we should provide a way to measure part of the business > >> >> transaction that relates to the Ignite. > >> >> > >> >> > >> >>> 20 дек. 2019 г., в 13:02, Andrey Gura < ag...@apache.org > написал(а): > >> >>> > >> >>>> The goal of the proposed metrics is to measure whole cache operations > >> >>>> behavior. > >> >>>> It provides some kind of statistics(histograms) for it. > >> >>> > >> >>> Nikolay, reformulating doesn't make metrics more meaningful. Seriously > >> >>> :) > >> >>> > >> >>>> Yes, metrics will evaluate API call performance > >> >>> > >> >>> And what? Where the sense in this value? I explained why this metrics > >> >>> are relatively useless. > >> >>> > >> >>>> These are metrics of client-side operation performance. > >> >>> > >> >>> Again. It's just a number without any sense. > >> >>> > >> >>>> I think a specific user has knowledge - what are his transactions. > >> >>> > >> >>> May be. But user can't distinguish one transaction from another, so > >> >>> his knowledge doesn't make sense definitely. > >> >>> > >> >>>> From these metrics it can answer on the question «If my transaction > >> >>>> includes cacheXXX, how long it usually takes?» > >> >>> > >> >>> Actually not. The same caches can be involved in a dozen of > >> >>> transactions and there are no ways to understand what transactions are > >> >>> slow or fast. It is useless. > >> >>> > >> >>>> I disagree here. > >> >>>> If you have a better approach to measure cache operations performance > >> >>>> - please, share your vision. > >> >>> > >> >>> I already wrote about better approach. Two main points: > >> >>> > >> >>> 1. Measure some important internals (WAL operations, checkpoint time, > >> >>> etc) because it can talk about real problems. > >> >>> 2. Measure business operations in user context, not cache API > >> >>> operations. > >> >>> > >> >>> So what we have? We have useless metrics that are doubled by useless > >> >>> histograms. > >> >>> > >> >>> We should reconsider approach to metrics and performance measuring. It > >> >>> is hard and long task. There are no need to commit tons of useless > >> >>> metrics that just decrease performance. > >> >>> > >> >>> Sorry for some sarcasm but I really believe in my opinion. Metrics > >> >>> problem exists very very long time and existing metrics discussed many > >> >>> times. No one can explain this metrics to users because it requires > >> >>> too many additional knowledge about internals. And metric value > >> >>> itself depends on many aspects of internals. It leads to impossibility > >> >>> of interpretation. And it's good time to remove it (in AI 3.0 due to a > >> >>> backward compatibility). > >> >>> > >> >>> On Thu, Dec 19, 2019 at 9:09 PM Николай Ижиков < > >> >>> nizhikov....@gmail.com > wrote: > >> >>>> > >> >>>> Hello, Andrey. > >> >>>> > >> >>>> The goal of the proposed metrics is to measure whole cache operations > >> >>>> behavior. > >> >>>> It provides some kind of statistics(histograms) for it. > >> >>>> For more fine-grained analysis one will be use tracing or other «go > >> >>>> deeper» tools. > >> >>>> > >> >>>>>> Measured for API calls on the caller node side > >> >>>>> Values will the same only for cases when node is remote relative to > >> >>>>> data > >> >>>> > >> >>>> Yes, metrics will evaluate API call performance. > >> >>>> I think this is the most valuable information from a user's point of > >> >>>> view. > >> >>>> > >> >>>> Regular user wants to know how fast his cache operation performs. > >> >>>> And these metrics provide the answer. > >> >>>> > >> >>>>> For regular data node (server node) timing will depend on answers > >> >>>>> for question: > >> >>>> > >> >>>> I think these answers are always available. > >> >>>> I barely can imagine a scenario when one monitor «black box» cluster > >> >>>> and don’t know it. > >> >>>> Even so, all answers are provided through system view we brought to > >> >>>> the Ignite :) > >> >>>> > >> >>>>> What is transaction commit or rollback time? > >> >>>> > >> >>>> These are metrics of client-side operation performance. > >> >>>> > >> >>>> I think a specific user has knowledge - what are his transactions. > >> >>>> From these metrics it can answer on the question «If my transaction > >> >>>> includes cacheXXX, how long it usually takes?» > >> >>>> I think it’s very valuable knowledge. > >> >>>> > >> >>>>> It will be implemented for most types of messages. > >> >>>> > >> >>>> Good, let’s do it? > >> >>>> > >> >>>>> So, from my point of view, commits for get/put/remove and > >> >>>>> commit/rollback should be reverted. > >> >>>> > >> >>>> I disagree here. > >> >>>> If you have a better approach to measure cache operations performance > >> >>>> - please, share your vision. > >> >>>> > >> >>>>> 19 дек. 2019 г., в 16:03, Andrey Gura < ag...@apache.org > > >> >>>>> написал(а): > >> >>>>> > >> >>>>> From my point of view, Ignite should provide meaningful metrics for > >> >>>>> internal components that could be useful for monitoring and analysis. > >> >>>>> All suggested options are meaningless in a sense. Below I'll try > >> >>>>> explain why. > >> >>>>> > >> >>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls on > >> >>>>>> the caller node side. > >> >>>>>> Implemented in [1], commit [2]. > >> >>>>> > >> >>>>> All cache operations in Ignite are distributed. So each value > >> >>>>> measured > >> >>>>> for some cache operation will vary depending on where actually > >> >>>>> operation is performed. Values will the same only for cases when node > >> >>>>> is remote relative to data (e.g. client node). > >> >>>>> > >> >>>>> For regular data node (server node) timing will depend on answers > >> >>>>> for question: > >> >>>>> > >> >>>>> - is node primary for particular key or not? (for all operations) > >> >>>>> - how many backups configured for the cache? (for put and remove) > >> >>>>> - what write synchronization mode is configured for particular cache? > >> >>>>> (for put and remove) > >> >>>>> - is readFromBackup enabled for the cache? (for get) > >> >>>>> > >> >>>>> Both Ignite users and Ignite developers can't make any decision based > >> >>>>> on this metrics. > >> >>>>> > >> >>>>>> * `commit`, `rollback` time histograms. Measured for API calls on > >> >>>>>> the caller node side [3]. > >> >>>>> > >> >>>>> What is transaction commit or rollback time? How it calculates in > >> >>>>> Ignite now? What actions included into transaction? What actions not > >> >>>>> related with cache executed during transactions? > >> >>>>> > >> >>>>> There is no any sense in time of transaction commit or rollback > >> >>>>> because there are no any way to understand what transaction was > >> >>>>> performed in particular period of time. Usually a lot of transactions > >> >>>>> and we can't to distinguish from each other. > >> >>>>> > >> >>>>> Moreover, transaction usually treats as business operation. So only > >> >>>>> way to measure performance properly is measure business operation > >> >>>>> time. That is user should create own metrics set for some business > >> >>>>> API. > >> >>>>> > >> >>>>> Further. What about cross cache transactions? At the moment tx > >> >>>>> commit/rollback time will be added to corresponding metrics per each > >> >>>>> cache evolved to the transaction. The *same time* for *each cache*. > >> >>>>> Absolutely meaningless. > >> >>>>> > >> >>>>> Again, both Ignite users and Ignite developers can't make any > >> >>>>> decision > >> >>>>> based on this metrics. But users can create own metrics set. > >> >>>>> > >> >>>>>> * histograms that measure the time of processing `get`, `put`, > >> >>>>>> `remove`, `commit`, `rollback` messages on affinity nodes(primary > >> >>>>>> and backups). > >> >>>>>> Ticket doesn't exist for it. > >> >>>>> > >> >>>>> It will be implemented for most types of messages. > >> >>>>> > >> >>>>> Metrics, application monitoring, performance analysis and measurement > >> >>>>> are a a little harder than it sounds. Therefore, we must approach > >> >>>>> this > >> >>>>> issue more carefully. > >> >>>>> Blindly adding new types of metrics will not only not improve the > >> >>>>> situation, but will also worsen the overall performance of the system > >> >>>>> because metric calculation always on the hot path. > >> >>>>> > >> >>>>> So, from my point of view, commits for get/put/remove and > >> >>>>> commit/rollback should be reverted. > >> >>>>> > >> >>>>> On Mon, Dec 16, 2019 at 5:39 PM Nikita Amelchev < > >> >>>>> nsamelc...@gmail.com > wrote: > >> >>>>>> > >> >>>>>> I think these metrics are useful. > >> >>>>>> > >> >>>>>> I have prepared PR [1] for commit and rollback histograms. [2] > >> >>>>>> Nikolay, could you take a look, please? > >> >>>>>> > >> >>>>>> If you do not mind, I will try to add affinity-nodes cache metrics: > >> >>>>>>>> * histograms that measure the time of processing `get`, `put`, > >> >>>>>>>> `remove`, `commit`, `rollback` messages on affinity nodes(primary > >> >>>>>>>> and backups). Ticket doesn't exist for it. > >> >>>>>> > >> >>>>>> I have filed a ticket for it. [3] > >> >>>>>> > >> >>>>>> [1] https://github.com/apache/ignite/pull/7141 > >> >>>>>> [2] https://issues.apache.org/jira/browse/IGNITE-12450 > >> >>>>>> [3] https://issues.apache.org/jira/browse/IGNITE-12453 > >> >>>>>> > >> >>>>>> пн, 16 дек. 2019 г. в 11:07, Alexei Scherbakov < > >> >>>>>> alexey.scherbak...@gmail.com >: > >> >>>>>>> > >> >>>>>>> I think they are very useful. > >> >>>>>>> > >> >>>>>>> пн, 16 дек. 2019 г. в 10:51, Николай Ижиков < nizhi...@apache.org > >> >>>>>>> >: > >> >>>>>>> > >> >>>>>>>> Hello, Alexei. > >> >>>>>>>> > >> >>>>>>>> Thanks for the link on the ticket, lableled it with the IEP-35 > >> >>>>>>>> label. > >> >>>>>>>> What do you think about proposed metrics set? > >> >>>>>>>> > >> >>>>>>>>> 16 дек. 2019 г., в 10:29, Alexei Scherbakov < > >> >>>>>>>> alexey.scherbak...@gmail.com > написал(а): > >> >>>>>>>>> > >> >>>>>>>>> Nikolay, > >> >>>>>>>>> > >> >>>>>>>>> What about batch operations? > >> >>>>>>>>> > >> >>>>>>>>> For messages processing the ticket does exist and even has an > >> >>>>>>>>> implementation from before new metrics API times [1] > >> >>>>>>>>> > >> >>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-10418 > >> >>>>>>>>> > >> >>>>>>>>> пн, 16 дек. 2019 г. в 10:12, Николай Ижиков < > >> >>>>>>>>> nizhi...@apache.org >: > >> >>>>>>>>> > >> >>>>>>>>>> Hello, Igniters. > >> >>>>>>>>>> > >> >>>>>>>>>> I want to provide the user answers to the following question: > >> >>>>>>>>>> "How cache > >> >>>>>>>>>> API operations perform?" > >> >>>>>>>>>> It seems, we need to implements metrics for basic cache API > >> >>>>>>>>>> operations > >> >>>>>>>>>> like get, put, remove for it. > >> >>>>>>>>>> > >> >>>>>>>>>> I think we should provide the following metrics: > >> >>>>>>>>>> > >> >>>>>>>>>> * `get`, `put`, `remove` time histograms. Measured for API > >> >>>>>>>>>> calls on the > >> >>>>>>>>>> caller node side. > >> >>>>>>>>>> Implemented in [1], commit [2]. > >> >>>>>>>>>> > >> >>>>>>>>>> * `commit`, `rollback` time histograms. Measured for API calls > >> >>>>>>>>>> on the > >> >>>>>>>>>> caller node side [3]. > >> >>>>>>>>>> > >> >>>>>>>>>> * histograms that measure the time of processing `get`, `put`, > >> >>>>>>>>>> `remove`, > >> >>>>>>>>>> `commit`, `rollback` messages on affinity nodes(primary and > >> >>>>>>>>>> backups). > >> >>>>>>>>>> Ticket doesn't exist for it. > >> >>>>>>>>>> > >> >>>>>>>>>> What do you think? > >> >>>>>>>>>> > >> >>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-12219 > >> >>>>>>>>>> [2] > >> >>>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364 > >> >>>>>>>>>> [3] https://issues.apache.org/jira/browse/IGNITE-12450 > >> >>>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> -- > >> >>>>>>>>> > >> >>>>>>>>> Best regards, > >> >>>>>>>>> Alexei Scherbakov > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> -- > >> >>>>>>> > >> >>>>>>> Best regards, > >> >>>>>>> Alexei Scherbakov > >> >>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> -- > >> >>>>>> Best wishes, > >> >>>>>> Amelchev Nikita > >> >>>> > >> >> > >> > > > >