Re: Extended logging for rebalance performance analysis

ткаленко кирилл Fri, 03 Jul 2020 07:22:01 -0700

Sorry, forget.

[1] - 
org.apache.ignite.internal.processors.cache.CacheGroupsMetricsRebalanceTest#testCacheGroupRebalance


03.07.2020, 17:20, "ткаленко кирилл" <tkalkir...@yandex.ru>:
> Hi, Stan!
>
> I don't understand you yet.
>
> Now you can use metrics as it was done in the test [1]. Or can you tell me 
> where to do this, for example when completing rebalancing for all groups?
>
> See what is now available and added in the logs:
> 1)Which group is rebalanced and which type of rebalance.
> Starting rebalance routine [grp0, topVer=AffinityTopologyVersion [topVer=4, 
> minorTopVer=0], supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, 
> fullPartitions=[4, 7], histPartitions=[], rebalanceId=1]
>
> 2) Completion of rebalancing from one of the suppliers.
> Completed rebalancing [grp=grp0, 
> supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, partitions=2, entries=60, 
> duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, 
> minorTopVer=0], progress=1/3, rebalanceId=1]
>
> 3) Completion of the entire rebalance.
> Completed rebalance chain: [rebalanceId=1, partitions=116, entries=400, 
> duration=41ms, bytesRcvd=40,4 KB]
>
> These messages have a common parameter rebalanceId=1.
>
> 03.07.2020, 16:48, "Stanislav Lukyanov" <stanlukya...@gmail.com>:
>>>   On 3 Jul 2020, at 09:51, ткаленко кирилл <tkalkir...@yandex.ru> wrote:
>>>
>>>   To calculate the average value, you can use the existing metrics 
>>> "RebalancingStartTime", "RebalancingLastCancelledTime", 
>>> "RebalancingEndTime", "RebalancingPartitionsLeft", 
>>> "RebalancingReceivedKeys" and "RebalancingReceivedBytes".
>>
>>  You can calculate it, and I believe that this is the first thing anyone 
>> would do when reading these logs and metrics.
>>  If that's an essential thing then maybe it should be available out of the 
>> box?
>>
>>>   This also works correctly with the historical rebalance.
>>>   Now we can see rebalance type for each group and for each supplier in 
>>> logs. I don't think we should duplicate this information.
>>>
>>>   [2020-07-03 09:49:31,481][INFO 
>>> ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance 
>>> routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, 
>>> minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a59577100000, 
>>> fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1]
>>
>>  I'm talking about adding info on how much data has been transferred during 
>> rebalance.
>>  When rebalance completes I'd like to know how much data has been 
>> transferred, was it historical or full, what was the average rebalance speed.
>>
>>  There are two reasons for having all that.
>>
>>  First, it helps to analyze the issues by searching the logs and looking for 
>> anomalies.
>>
>>  Second, this allows to automate alerts: e.g. if you know your typical 
>> historical rebalance speed, you can trigger an alert if it drops below that.
>>
>>>   03.07.2020, 02:49, "Stanislav Lukyanov" <stanlukya...@gmail.com>:
>>>>   Kirill,
>>>>
>>>>   I've looked through the patch.
>>>>   Looks good, but it feels like the first thing someone will try to do 
>>>> given bytesRcvd and duration is to divide one by another to get an average 
>>>> speed.
>>>>   Do you think it's reasonable to also add it to the logs? Maybe even to 
>>>> the metrics?
>>>>
>>>>   Also, this works with historical rebalance, right? Can we specify how 
>>>> much data was transferred via historical or full rebalance from each 
>>>> supplier?
>>>>   Maybe even estimate transfer speed in entries and bytes for each 
>>>> rebalance type?
>>>>
>>>>   Thanks,
>>>>   Stan
>>>>
>>>>>    On 29 Jun 2020, at 11:50, Ivan Rakov <ivan.glu...@gmail.com> wrote:
>>>>>
>>>>>    +1 to Alex G.
>>>>>
>>>>>    From my experience, the most interesting cases with Ignite rebalancing
>>>>>    happen exactly in production. According to the fact that we already 
>>>>> have
>>>>>    detailed rebalancing logging, adding info about rebalance performance 
>>>>> looks
>>>>>    like a reasonable improvement. With new logs we'll be able to detect 
>>>>> and
>>>>>    investigate situations when rebalance is slow due to uneven suppliers
>>>>>    distribution or network issues.
>>>>>    Option to disable the feature in runtime shouldn't be used often, but 
>>>>> it
>>>>>    will keep us on the safe side in case something goes wrong.
>>>>>    The format described in
>>>>>    https://issues.apache.org/jira/browse/IGNITE-12080 looks
>>>>>    good to me.
>>>>>
>>>>>    On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл <tkalkir...@yandex.ru>
>>>>>    wrote:
>>>>>
>>>>>>    Hello, Alexey!
>>>>>>
>>>>>>    Currently there is no way to disable / enable it, but it seems that 
>>>>>> the
>>>>>>    logs will not be overloaded, since Alexei Scherbakov offer seems 
>>>>>> reasonable
>>>>>>    and compact. Of course, you can add disabling / enabling statistics
>>>>>>    collection via jmx for example.
>>>>>>
>>>>>>    23.06.2020, 18:47, "Alexey Goncharuk" <alexey.goncha...@gmail.com>:
>>>>>>>    Hello Maxim, folks,
>>>>>>>
>>>>>>>    ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mmu...@apache.org>:
>>>>>>>
>>>>>>>>    We won't do performance analysis on the production environment. Each
>>>>>>>>    time we need performance analysis it will be done on a test
>>>>>>>>    environment with verbose logging enabled. Thus I suggest moving 
>>>>>>>> these
>>>>>>>>    changes to a separate `profiling` module and extend the logging much
>>>>>>>>    more without any ышяу limitations. The same as these [2] [3]
>>>>>>>>    activities do.
>>>>>>>
>>>>>>>    I strongly disagree with this statement. I am not sure who is meant 
>>>>>>> here
>>>>>>>    by 'we', but I see a strong momentum in increasing observability 
>>>>>>> tooling
>>>>>>>    that helps people to understand what exactly happens in the 
>>>>>>> production
>>>>>>>    environment [1]. Not everybody can afford two identical environments 
>>>>>>> for
>>>>>>>    testing. We should make sure users have enough information to 
>>>>>>> understand
>>>>>>>    the root cause after the incident happened, and not force them to
>>>>>>    reproduce
>>>>>>>    it, let alone make them add another module to the classpath and 
>>>>>>> restart
>>>>>>    the
>>>>>>>    nodes.
>>>>>>>    I think having this functionality in the core module with the 
>>>>>>> ability to
>>>>>>>    disable/enable it is the right approach. Having the information 
>>>>>>> printed
>>>>>>    to
>>>>>>>    log is ok, having it in an event that can be sent to a 
>>>>>>> monitoring/tracing
>>>>>>>    subsystem is even better.
>>>>>>>
>>>>>>>    Kirill, can we enable and disable this feature in runtime to avoid 
>>>>>>> the
>>>>>>    very
>>>>>>>    same nodes restart?
>>>>>>>
>>>>>>>    [1]
>>>>>>    https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/

Re: Extended logging for rebalance performance analysis

Reply via email to