Hi, Stan! I don't understand you yet.
Now you can use metrics as it was done in the test [1]. Or can you tell me where to do this, for example when completing rebalancing for all groups? See what is now available and added in the logs: 1)Which group is rebalanced and which type of rebalance. Starting rebalance routine [grp0, topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, fullPartitions=[4, 7], histPartitions=[], rebalanceId=1] 2) Completion of rebalancing from one of the suppliers. Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, rebalanceId=1] 3) Completion of the entire rebalance. Completed rebalance chain: [rebalanceId=1, partitions=116, entries=400, duration=41ms, bytesRcvd=40,4 KB] These messages have a common parameter rebalanceId=1. 03.07.2020, 16:48, "Stanislav Lukyanov" <stanlukya...@gmail.com>: >> On 3 Jul 2020, at 09:51, ткаленко кирилл <tkalkir...@yandex.ru> wrote: >> >> To calculate the average value, you can use the existing metrics >> "RebalancingStartTime", "RebalancingLastCancelledTime", >> "RebalancingEndTime", "RebalancingPartitionsLeft", "RebalancingReceivedKeys" >> and "RebalancingReceivedBytes". > > You can calculate it, and I believe that this is the first thing anyone would > do when reading these logs and metrics. > If that's an essential thing then maybe it should be available out of the box? > >> This also works correctly with the historical rebalance. >> Now we can see rebalance type for each group and for each supplier in logs. >> I don't think we should duplicate this information. >> >> [2020-07-03 09:49:31,481][INFO >> ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance >> routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, >> minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a59577100000, >> fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1] > > I'm talking about adding info on how much data has been transferred during > rebalance. > When rebalance completes I'd like to know how much data has been transferred, > was it historical or full, what was the average rebalance speed. > > There are two reasons for having all that. > > First, it helps to analyze the issues by searching the logs and looking for > anomalies. > > Second, this allows to automate alerts: e.g. if you know your typical > historical rebalance speed, you can trigger an alert if it drops below that. > >> 03.07.2020, 02:49, "Stanislav Lukyanov" <stanlukya...@gmail.com>: >>> Kirill, >>> >>> I've looked through the patch. >>> Looks good, but it feels like the first thing someone will try to do given >>> bytesRcvd and duration is to divide one by another to get an average speed. >>> Do you think it's reasonable to also add it to the logs? Maybe even to the >>> metrics? >>> >>> Also, this works with historical rebalance, right? Can we specify how much >>> data was transferred via historical or full rebalance from each supplier? >>> Maybe even estimate transfer speed in entries and bytes for each rebalance >>> type? >>> >>> Thanks, >>> Stan >>> >>>> On 29 Jun 2020, at 11:50, Ivan Rakov <ivan.glu...@gmail.com> wrote: >>>> >>>> +1 to Alex G. >>>> >>>> From my experience, the most interesting cases with Ignite rebalancing >>>> happen exactly in production. According to the fact that we already have >>>> detailed rebalancing logging, adding info about rebalance performance >>>> looks >>>> like a reasonable improvement. With new logs we'll be able to detect and >>>> investigate situations when rebalance is slow due to uneven suppliers >>>> distribution or network issues. >>>> Option to disable the feature in runtime shouldn't be used often, but it >>>> will keep us on the safe side in case something goes wrong. >>>> The format described in >>>> https://issues.apache.org/jira/browse/IGNITE-12080 looks >>>> good to me. >>>> >>>> On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл <tkalkir...@yandex.ru> >>>> wrote: >>>> >>>>> Hello, Alexey! >>>>> >>>>> Currently there is no way to disable / enable it, but it seems that the >>>>> logs will not be overloaded, since Alexei Scherbakov offer seems >>>>> reasonable >>>>> and compact. Of course, you can add disabling / enabling statistics >>>>> collection via jmx for example. >>>>> >>>>> 23.06.2020, 18:47, "Alexey Goncharuk" <alexey.goncha...@gmail.com>: >>>>>> Hello Maxim, folks, >>>>>> >>>>>> ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mmu...@apache.org>: >>>>>> >>>>>>> We won't do performance analysis on the production environment. Each >>>>>>> time we need performance analysis it will be done on a test >>>>>>> environment with verbose logging enabled. Thus I suggest moving these >>>>>>> changes to a separate `profiling` module and extend the logging much >>>>>>> more without any ышяу limitations. The same as these [2] [3] >>>>>>> activities do. >>>>>> >>>>>> I strongly disagree with this statement. I am not sure who is meant >>>>>> here >>>>>> by 'we', but I see a strong momentum in increasing observability >>>>>> tooling >>>>>> that helps people to understand what exactly happens in the production >>>>>> environment [1]. Not everybody can afford two identical environments >>>>>> for >>>>>> testing. We should make sure users have enough information to >>>>>> understand >>>>>> the root cause after the incident happened, and not force them to >>>>> reproduce >>>>>> it, let alone make them add another module to the classpath and restart >>>>> the >>>>>> nodes. >>>>>> I think having this functionality in the core module with the ability >>>>>> to >>>>>> disable/enable it is the right approach. Having the information printed >>>>> to >>>>>> log is ok, having it in an event that can be sent to a >>>>>> monitoring/tracing >>>>>> subsystem is even better. >>>>>> >>>>>> Kirill, can we enable and disable this feature in runtime to avoid the >>>>> very >>>>>> same nodes restart? >>>>>> >>>>>> [1] >>>>> https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/