+1 with Anton decisions.
>Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov <a...@apache.org>: > >Folks, > >It looks like we're trying to implement "extended debug" instead of >"monitoring". >It should not be interesting for real admin what phase of PME is in >progress and so on. >Interested metrics are >- total blocked time (will be used for real SLA counting) >- are we blocked right now (shows we have an SLA degradation right now) >Duration of the current blocking period can be easily presented using any >modern monitoring tool by regular checks. >Initial true will means "period start", precision will be a result of >checks frequency. >Anyway, I'm ok to have current metric presented with long, where long is a >duration, see no reason, but ok :) > >All other features you mentioned are useful for code or >deployment improving and can (should) be taken from logs at the analysis >phase. > >On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < ivan.glu...@gmail.com > wrote: > >> Folks, let me step in. >> >> Nikita, thanks for your suggestions! >> >> > 1. initialVersion. Topology version that initiates the exchange. >> > 2. initTime. Time PME was started. >> > 3. initEvent. Event that triggered PME. >> > 4. partitionReleaseTime. Time when a node has finished waiting for all >> > updates and translations on a previous topology. >> > 5. sendSingleMessageTime. Time when a node sent a single message. >> > 6. recieveFullMessageTime. Time when a node received a full message. >> > 7. finishTime. Time PME was ended. >> > >> > When new PME started all these metrics resets. >> Every metric from Nikita's list looks useful and simple to implement. >> I think that it would be better to change format of metrics 4, 5, 6 and >> 7 a bit: we can keep only difference between time of previous event and >> time of corresponding event. Such metrics would be easier to perceive: >> they answer to specific questions "how much time did partition release >> take?" or "how much time did awaiting of distributed phase end take?". >> Also, if results of 4, 5, 6, 7 will be exported to monitoring system, >> graphs will show how different stages times change from one PME to another. >> >> > When PME cause no blocking, it's a good PME and I see no reason to have >> > monitoring related to it >> Agree with Anton here. These metrics should be measured only for true >> distributed exchange. Saving results for client leave/join PMEs will >> just complicate monitoring. >> >> > I agree with total blocking duration metric but >> > I still don't understand why instant value indicating that operations are >> > blocked should be boolean. >> > Duration time since blocking has started looks more appropriate and >> useful. >> > It gives more information while semantic is left the same. >> Totally agree with Pavel here. Both "accumulated block time" and >> "current PME block time" metrics are useful. Growth of accumulated >> metric for specific period of time (should be easy to check via >> monitoring system graph) will show for how much business operations were >> blocked in total, and non-zero current metric will show that we are >> experiencing issues right now. Boolean metric "are we blocked right now" >> is not needed as it's obviously can be inferred from "current PME block >> time". >> >> Best Regards, >> Ivan Rakov >> >> On 23.07.2019 16:02, Pavel Kovalenko wrote: >> > Nikita, >> > >> > I agree with total blocking duration metric but >> > I still don't understand why instant value indicating that operations are >> > blocked should be boolean. >> > Duration time since blocking has started looks more appropriate and >> useful. >> > It gives more information while semantic is left the same. >> > >> > >> > >> > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < nsamelc...@gmail.com >: >> > >> >> Folks, >> >> >> >> All previous suggestions have some disadvantages. It can be several >> >> exchanges between two metric updates and fast exchange can rewrite >> >> previous long exchange. >> >> >> >> We can introduce a metric of total blocking duration that will >> >> accumulate at the end of the exchange. So, users will get actual >> >> information about how long operations were blocked. Cluster metric >> >> will be a maximum of local nodes metrics. And we need a boolean metric >> >> that will indicate realtime status. It needs because of duration >> >> metric updates at the end of the exchange. >> >> >> >> So I propose to change the current metric that not released to the >> >> totalCacheOperationsBlockingDuration metric and to add the >> >> isCacheOperationsBlocked metric. >> >> >> >> WDYT? >> >> >> >> пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov < a...@apache.org >: >> >>> Nikolay, >> >>> >> >>> Still see no reason to replace boolean with long. >> >>> >> >>> On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov < nizhi...@apache.org > >> >> wrote: >> >>>> Anton. >> >>>> >> >>>> 1. Value exported based on SPI settings, not in the moment it changed. >> >>>> >> >>>> 2. Clock synchronisation - if we export start time, we should also >> >> export >> >>>> node local timestamp. >> >>>> >> >>>> пн, 22 июля 2019 г., 8:33 Anton Vinogradov < a...@apache.org >: >> >>>> >> >>>>> Folks, >> >>>>> >> >>>>> What's the reason for duration counting? >> >>>>> AFAIU, it's a monitoring system feature to count the durations. >> >>>>> Sine monitoring system checks metrics periodically it will know the >> >>>>> duration by its own log. >> >>>>> >> >>>>> On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko < jokse...@gmail.com > >> >>>>> wrote: >> >>>>> >> >>>>>> Nikita, >> >>>>>> >> >>>>>> Yes, I mean duration not timestamp. For the metric name, I suggest >> >>>>>> "cacheOperationsBlockingDuration", I think it cleaner represents >> >> what >> >>>> is >> >>>>>> blocked during PME. >> >>>>>> We can also combine both timestamp >> >> "cacheOperationsBlockingStartTs" and >> >>>>>> duration to have better correlation when cache operations were >> >> blocked >> >>>>> and >> >>>>>> how much time it's taken. >> >>>>>> For instant view (like in JMX bean) a calculated value as you >> >> mentioned >> >>>>>> can be used. >> >>>>>> For metrics are exported to some backend (IEP-35) a counter can be >> >>>> used. >> >>>>>> The counter is incremented by blocking time after blocking has >> >> ended. >> >>>>>> пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev < nsamelc...@gmail.com >> >>> : >> >>>>>>> Pavel, >> >>>>>>> >> >>>>>>> The main purpose of this metric is >> >>>>>>>>> how much time we wait for resuming cache operations >> >>>>>>> Seems I misunderstood you. Do you mean timestamp or duration here? >> >>>>>>>>> What do you think if we change the boolean value of metric to a >> >>>> long >> >>>>>>> value that represents time in milliseconds when operations were >> >>>> blocked? >> >>>>>>> This time can be calculated as (currentTime - >> >>>>>>> timeSinceOperationsBlocked) in case of timestamp. >> >>>>>>> >> >>>>>>> Duration will be more understandable. It'll be something like >> >>>>>>> getCurrentBlockingPmeDuration. But I haven't come up with a better >> >>>>>>> name yet. >> >>>>>>> >> >>>>>>> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko < jokse...@gmail.com >> >>> : >> >>>>>>>> Nikita, >> >>>>>>>> >> >>>>>>>> I think getCurrentPmeDuration doesn't show useful information. >> >> The >> >>>>> main >> >>>>>>> PME side effect for end-users is blocking cache operations. Not >> >> all >> >>>> PME >> >>>>>>> time blocks it. >> >>>>>>>> What information gives to an end-user timestamp of >> >>>>>>> "timeSinceOperationsBlocked"? For what analysis it can be used and >> >>>> how? >> >>>>>>>> пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev < >> >> nsamelc...@gmail.com >> >>>>> : >> >>>>>>>>> Hi Pavel, >> >>>>>>>>> >> >>>>>>>>> This time already can be obtained from the >> >> getCurrentPmeDuration >> >>>> and >> >>>>>>>>> new isOperationsBlockedByPme metrics. >> >>>>>>>>> >> >>>>>>>>> As an alternative solution, I can rework recently added >> >>>>>>>>> getCurrentPmeDuration metric (not released yet). Seems for >> >> users it >> >>>>>>>>> useless in case of non-blocking PME. >> >>>>>>>>> Lets name it timeSinceOperationsBlocked. It'll be timestamp >> >> when >> >>>>>>>>> blocking started (minimal value of cluster nodes) and 0 if >> >> blocking >> >>>>>>>>> ends (there is no running PME). >> >>>>>>>>> >> >>>>>>>>> WDYT? >> >>>>>>>>> >> >>>>>>>>> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko < >> >> jokse...@gmail.com >: >> >>>>>>>>>> Hi Nikita, >> >>>>>>>>>> >> >>>>>>>>>> Thank you for working on this. What do you think if we >> >> change the >> >>>>>>> boolean >> >>>>>>>>>> value of metric to a long value that represents time in >> >>>>> milliseconds >> >>>>>>> when >> >>>>>>>>>> operations were blocked? >> >>>>>>>>>> Since we have not only JMX and now metrics are periodically >> >>>>> exported >> >>>>>>> to >> >>>>>>>>>> some backend it can give a more clear picture of how much >> >> time we >> >>>>>>> wait for >> >>>>>>>>>> resuming cache operations instead of instant boolean >> >> indicator. >> >>>>>>>>>> пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev < >> >>>> nsamelc...@gmail.com >> >>>>>> : >> >>>>>>>>>>> Anton, Nikolay, >> >>>>>>>>>>> >> >>>>>>>>>>> Thanks for the support. >> >>>>>>>>>>> >> >>>>>>>>>>> For now, we have the getCurrentPmeDuration() metric that >> >> does >> >>>> not >> >>>>>>> show >> >>>>>>>>>>> influence on the cluster correctly. PME can be without >> >> blocking >> >>>>>>>>>>> operations. For example, client node join/leave events. >> >>>>>>>>>>> >> >>>>>>>>>>> I suggest add new metric - isOperationsBlockedByPme(). >> >>>> Together, >> >>>>>>> these >> >>>>>>>>>>> metrics will show influence of the PME on cluster and user >> >>>>>>> operations. >> >>>>>>>>>>> I have prepared PR for this (Bot visa is green). [1] Can >> >> anyone >> >>>>>>> take a >> >>>>>>>>>>> look? >> >>>>>>>>>>> >> >>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-11961 >> >>>>>>>>>>> >> >>>>>>>>>>> вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov < >> >>>>> nizhi...@apache.org >> >>>>>>>> : >> >>>>>>>>>>>> I think administator of Ignite cluster should be able to >> >>>>> monitor >> >>>>>>> all >> >>>>>>>>>>> Ignite process, including non blocking PME. >> >>>>>>>>>>>> В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет: >> >>>>>>>>>>>>> BTW, >> >>>>>>>>>>>>> Found PME metric - getCurrentPmeDuration(). >> >>>>>>>>>>>>> Seems, it shows exactly PME time and not so useful >> >> because >> >>>> of >> >>>>>>> this. >> >>>>>>>>>>>>> The goal it so show exactly blocking period. >> >>>>>>>>>>>>> When PME cause no blocking, it's a good PME and I see >> >> no >> >>>>>>> reason to have >> >>>>>>>>>>>>> monitoring related to it :) >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov < >> >>>>>>> nizhi...@apache.org > >> >>>>>>>>>>> wrote: >> >>>>>>>>>>>>>> Anton. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Why do we need to postpone implementation of this >> >>>> metrics? >> >>>>>>>>>>>>>> For now, implementation of new metric is very simple. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> I think we can implement this metrics as a single >> >>>>>>> contribution. >> >>>>>>>>>>>>>> В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov >> >> пишет: >> >>>>>>>>>>>>>>> Nikita, >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> Looks like all we need now is a 1 simple metric: >> >> are >> >>>>>>> operations >> >>>>>>>>>>> blocked? >> >>>>>>>>>>>>>>> Just a true or false. >> >>>>>>>>>>>>>>> Lest start from this. >> >>>>>>>>>>>>>>> All other metrics can be extracted from logs now >> >> and >> >>>> can >> >>>>> be >> >>>>>>>>>>> implemented >> >>>>>>>>>>>>>>> later. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov < >> >>>>>>>>>>> nizhi...@apache.org > >> >>>>>>>>>>>>>>> wrote: >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> +1. >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> Nikita, please, go ahead. >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> вт, 16 июля 2019 г., 11:45 Nikita Amelchev < >> >>>>>>> nsamelc...@gmail.com >> >>>>>>>>>>>> : >> >>>>>>>>>>>>>>>>> Hello, Igniters. >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> I suggest to add some useful metrics about the >> >>>>>>> partition map >> >>>>>>>>>>> exchange >> >>>>>>>>>>>>>>>>> (PME). For now, the duration of PME stages >> >>>> available >> >>>>>>> only in >> >>>>>>>>>>> log >> >>>>>>>>>>>>>> files >> >>>>>>>>>>>>>>>>> and cannot be obtained using JMX or other >> >> external >> >>>>>>> tools. [1] >> >>>>>>>>>>>>>>>>> I made the list of local node metrics that >> >> help to >> >>>>>>> understand >> >>>>>>>>>>> the >> >>>>>>>>>>>>>>>>> actual status of current PME: >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> 1. initialVersion. Topology version that >> >> initiates >> >>>>> the >> >>>>>>>>>>> exchange. >> >>>>>>>>>>>>>>>>> 2. initTime. Time PME was started. >> >>>>>>>>>>>>>>>>> 3. initEvent. Event that triggered PME. >> >>>>>>>>>>>>>>>>> 4. partitionReleaseTime. Time when a node has >> >>>>> finished >> >>>>>>> waiting >> >>>>>>>>>>> for >> >>>>>>>>>>>>>> all >> >>>>>>>>>>>>>>>>> updates and translations on a previous >> >> topology. >> >>>>>>>>>>>>>>>>> 5. sendSingleMessageTime. Time when a node >> >> sent a >> >>>>>>> single >> >>>>>>>>>>> message. >> >>>>>>>>>>>>>>>>> 6. recieveFullMessageTime. Time when a node >> >>>> received >> >>>>> a >> >>>>>>> full >> >>>>>>>>>>> message. >> >>>>>>>>>>>>>>>>> 7. finishTime. Time PME was ended. >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> When new PME started all these metrics resets. >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> These metrics help to understand: >> >>>>>>>>>>>>>>>>> - how long PME was (current or previous). >> >>>>>>>>>>>>>>>>> - how long awaited for all updates was >> >> completed. >> >>>>>>>>>>>>>>>>> - what node blocks PME (didn't send a single >> >>>> message) >> >>>>>>>>>>>>>>>>> - what triggered PME. >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> Thoughts? >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> [1] >> >>>>> https://issues.apache.org/jira/browse/IGNITE-11961 >> >>>>>>>>>>>>>>>>> -- >> >>>>>>>>>>>>>>>>> Best wishes, >> >>>>>>>>>>>>>>>>> Amelchev Nikita >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> -- >> >>>>>>>>>>> Best wishes, >> >>>>>>>>>>> Amelchev Nikita >> >>>>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> -- >> >>>>>>>>> Best wishes, >> >>>>>>>>> Amelchev Nikita >> >>>>>>> >> >>>>>>> >> >>>>>>> -- >> >>>>>>> Best wishes, >> >>>>>>> Amelchev Nikita >> >>>>>>> >> >> >> >> >> >> -- >> >> Best wishes, >> >> Amelchev Nikita >> >> >> -- Zhenya Stanilovsky