Sorry, forget. [1] - org.apache.ignite.internal.processors.cache.CacheGroupsMetricsRebalanceTest#testCacheGroupRebalance
03.07.2020, 17:20, "ткаленко кирилл" <tkalkir...@yandex.ru>: > Hi, Stan! > > I don't understand you yet. > > Now you can use metrics as it was done in the test [1]. Or can you tell me > where to do this, for example when completing rebalancing for all groups? > > See what is now available and added in the logs: > 1)Which group is rebalanced and which type of rebalance. > Starting rebalance routine [grp0, topVer=AffinityTopologyVersion [topVer=4, > minorTopVer=0], supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, > fullPartitions=[4, 7], histPartitions=[], rebalanceId=1] > > 2) Completion of rebalancing from one of the suppliers. > Completed rebalancing [grp=grp0, > supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, partitions=2, entries=60, > duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, > minorTopVer=0], progress=1/3, rebalanceId=1] > > 3) Completion of the entire rebalance. > Completed rebalance chain: [rebalanceId=1, partitions=116, entries=400, > duration=41ms, bytesRcvd=40,4 KB] > > These messages have a common parameter rebalanceId=1. > > 03.07.2020, 16:48, "Stanislav Lukyanov" <stanlukya...@gmail.com>: >>> On 3 Jul 2020, at 09:51, ткаленко кирилл <tkalkir...@yandex.ru> wrote: >>> >>> To calculate the average value, you can use the existing metrics >>> "RebalancingStartTime", "RebalancingLastCancelledTime", >>> "RebalancingEndTime", "RebalancingPartitionsLeft", >>> "RebalancingReceivedKeys" and "RebalancingReceivedBytes". >> >> You can calculate it, and I believe that this is the first thing anyone >> would do when reading these logs and metrics. >> If that's an essential thing then maybe it should be available out of the >> box? >> >>> This also works correctly with the historical rebalance. >>> Now we can see rebalance type for each group and for each supplier in >>> logs. I don't think we should duplicate this information. >>> >>> [2020-07-03 09:49:31,481][INFO >>> ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance >>> routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, >>> minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a59577100000, >>> fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1] >> >> I'm talking about adding info on how much data has been transferred during >> rebalance. >> When rebalance completes I'd like to know how much data has been >> transferred, was it historical or full, what was the average rebalance speed. >> >> There are two reasons for having all that. >> >> First, it helps to analyze the issues by searching the logs and looking for >> anomalies. >> >> Second, this allows to automate alerts: e.g. if you know your typical >> historical rebalance speed, you can trigger an alert if it drops below that. >> >>> 03.07.2020, 02:49, "Stanislav Lukyanov" <stanlukya...@gmail.com>: >>>> Kirill, >>>> >>>> I've looked through the patch. >>>> Looks good, but it feels like the first thing someone will try to do >>>> given bytesRcvd and duration is to divide one by another to get an average >>>> speed. >>>> Do you think it's reasonable to also add it to the logs? Maybe even to >>>> the metrics? >>>> >>>> Also, this works with historical rebalance, right? Can we specify how >>>> much data was transferred via historical or full rebalance from each >>>> supplier? >>>> Maybe even estimate transfer speed in entries and bytes for each >>>> rebalance type? >>>> >>>> Thanks, >>>> Stan >>>> >>>>> On 29 Jun 2020, at 11:50, Ivan Rakov <ivan.glu...@gmail.com> wrote: >>>>> >>>>> +1 to Alex G. >>>>> >>>>> From my experience, the most interesting cases with Ignite rebalancing >>>>> happen exactly in production. According to the fact that we already >>>>> have >>>>> detailed rebalancing logging, adding info about rebalance performance >>>>> looks >>>>> like a reasonable improvement. With new logs we'll be able to detect >>>>> and >>>>> investigate situations when rebalance is slow due to uneven suppliers >>>>> distribution or network issues. >>>>> Option to disable the feature in runtime shouldn't be used often, but >>>>> it >>>>> will keep us on the safe side in case something goes wrong. >>>>> The format described in >>>>> https://issues.apache.org/jira/browse/IGNITE-12080 looks >>>>> good to me. >>>>> >>>>> On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл <tkalkir...@yandex.ru> >>>>> wrote: >>>>> >>>>>> Hello, Alexey! >>>>>> >>>>>> Currently there is no way to disable / enable it, but it seems that >>>>>> the >>>>>> logs will not be overloaded, since Alexei Scherbakov offer seems >>>>>> reasonable >>>>>> and compact. Of course, you can add disabling / enabling statistics >>>>>> collection via jmx for example. >>>>>> >>>>>> 23.06.2020, 18:47, "Alexey Goncharuk" <alexey.goncha...@gmail.com>: >>>>>>> Hello Maxim, folks, >>>>>>> >>>>>>> ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mmu...@apache.org>: >>>>>>> >>>>>>>> We won't do performance analysis on the production environment. Each >>>>>>>> time we need performance analysis it will be done on a test >>>>>>>> environment with verbose logging enabled. Thus I suggest moving >>>>>>>> these >>>>>>>> changes to a separate `profiling` module and extend the logging much >>>>>>>> more without any ышяу limitations. The same as these [2] [3] >>>>>>>> activities do. >>>>>>> >>>>>>> I strongly disagree with this statement. I am not sure who is meant >>>>>>> here >>>>>>> by 'we', but I see a strong momentum in increasing observability >>>>>>> tooling >>>>>>> that helps people to understand what exactly happens in the >>>>>>> production >>>>>>> environment [1]. Not everybody can afford two identical environments >>>>>>> for >>>>>>> testing. We should make sure users have enough information to >>>>>>> understand >>>>>>> the root cause after the incident happened, and not force them to >>>>>> reproduce >>>>>>> it, let alone make them add another module to the classpath and >>>>>>> restart >>>>>> the >>>>>>> nodes. >>>>>>> I think having this functionality in the core module with the >>>>>>> ability to >>>>>>> disable/enable it is the right approach. Having the information >>>>>>> printed >>>>>> to >>>>>>> log is ok, having it in an event that can be sent to a >>>>>>> monitoring/tracing >>>>>>> subsystem is even better. >>>>>>> >>>>>>> Kirill, can we enable and disable this feature in runtime to avoid >>>>>>> the >>>>>> very >>>>>>> same nodes restart? >>>>>>> >>>>>>> [1] >>>>>> https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/