Re: Extended logging for rebalance performance analysis

ткаленко кирилл Fri, 03 Jul 2020 07:21:01 -0700

Hi, Stan!

I don't understand you yet.


Now you can use metrics as it was done in the test [1]. Or can you tell me 
where to do this, for example when completing rebalancing for all groups?

See what is now available and added in the logs:
1)Which group is rebalanced and which type of rebalance.
Starting rebalance routine [grp0, topVer=AffinityTopologyVersion [topVer=4, 
minorTopVer=0], supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, 
fullPartitions=[4, 7], histPartitions=[], rebalanceId=1]

2) Completion of rebalancing from one of the suppliers.
Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, 
partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, 
topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, 
rebalanceId=1]

3) Completion of the entire rebalance.
Completed rebalance chain: [rebalanceId=1, partitions=116, entries=400, 
duration=41ms, bytesRcvd=40,4 KB]

These messages have a common parameter rebalanceId=1.

03.07.2020, 16:48, "Stanislav Lukyanov" <stanlukya...@gmail.com>:
>>  On 3 Jul 2020, at 09:51, ткаленко кирилл <tkalkir...@yandex.ru> wrote:
>>
>>  To calculate the average value, you can use the existing metrics 
>> "RebalancingStartTime", "RebalancingLastCancelledTime", 
>> "RebalancingEndTime", "RebalancingPartitionsLeft", "RebalancingReceivedKeys" 
>> and "RebalancingReceivedBytes".
>
> You can calculate it, and I believe that this is the first thing anyone would 
> do when reading these logs and metrics.
> If that's an essential thing then maybe it should be available out of the box?
>
>>  This also works correctly with the historical rebalance.
>>  Now we can see rebalance type for each group and for each supplier in logs. 
>> I don't think we should duplicate this information.
>>
>>  [2020-07-03 09:49:31,481][INFO 
>> ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance 
>> routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, 
>> minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a59577100000, 
>> fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1]
>
> I'm talking about adding info on how much data has been transferred during 
> rebalance.
> When rebalance completes I'd like to know how much data has been transferred, 
> was it historical or full, what was the average rebalance speed.
>
> There are two reasons for having all that.
>
> First, it helps to analyze the issues by searching the logs and looking for 
> anomalies.
>
> Second, this allows to automate alerts: e.g. if you know your typical 
> historical rebalance speed, you can trigger an alert if it drops below that.
>
>>  03.07.2020, 02:49, "Stanislav Lukyanov" <stanlukya...@gmail.com>:
>>>  Kirill,
>>>
>>>  I've looked through the patch.
>>>  Looks good, but it feels like the first thing someone will try to do given 
>>> bytesRcvd and duration is to divide one by another to get an average speed.
>>>  Do you think it's reasonable to also add it to the logs? Maybe even to the 
>>> metrics?
>>>
>>>  Also, this works with historical rebalance, right? Can we specify how much 
>>> data was transferred via historical or full rebalance from each supplier?
>>>  Maybe even estimate transfer speed in entries and bytes for each rebalance 
>>> type?
>>>
>>>  Thanks,
>>>  Stan
>>>
>>>>   On 29 Jun 2020, at 11:50, Ivan Rakov <ivan.glu...@gmail.com> wrote:
>>>>
>>>>   +1 to Alex G.
>>>>
>>>>   From my experience, the most interesting cases with Ignite rebalancing
>>>>   happen exactly in production. According to the fact that we already have
>>>>   detailed rebalancing logging, adding info about rebalance performance 
>>>> looks
>>>>   like a reasonable improvement. With new logs we'll be able to detect and
>>>>   investigate situations when rebalance is slow due to uneven suppliers
>>>>   distribution or network issues.
>>>>   Option to disable the feature in runtime shouldn't be used often, but it
>>>>   will keep us on the safe side in case something goes wrong.
>>>>   The format described in
>>>>   https://issues.apache.org/jira/browse/IGNITE-12080 looks
>>>>   good to me.
>>>>
>>>>   On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл <tkalkir...@yandex.ru>
>>>>   wrote:
>>>>
>>>>>   Hello, Alexey!
>>>>>
>>>>>   Currently there is no way to disable / enable it, but it seems that the
>>>>>   logs will not be overloaded, since Alexei Scherbakov offer seems 
>>>>> reasonable
>>>>>   and compact. Of course, you can add disabling / enabling statistics
>>>>>   collection via jmx for example.
>>>>>
>>>>>   23.06.2020, 18:47, "Alexey Goncharuk" <alexey.goncha...@gmail.com>:
>>>>>>   Hello Maxim, folks,
>>>>>>
>>>>>>   ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mmu...@apache.org>:
>>>>>>
>>>>>>>   We won't do performance analysis on the production environment. Each
>>>>>>>   time we need performance analysis it will be done on a test
>>>>>>>   environment with verbose logging enabled. Thus I suggest moving these
>>>>>>>   changes to a separate `profiling` module and extend the logging much
>>>>>>>   more without any ышяу limitations. The same as these [2] [3]
>>>>>>>   activities do.
>>>>>>
>>>>>>   I strongly disagree with this statement. I am not sure who is meant 
>>>>>> here
>>>>>>   by 'we', but I see a strong momentum in increasing observability 
>>>>>> tooling
>>>>>>   that helps people to understand what exactly happens in the production
>>>>>>   environment [1]. Not everybody can afford two identical environments 
>>>>>> for
>>>>>>   testing. We should make sure users have enough information to 
>>>>>> understand
>>>>>>   the root cause after the incident happened, and not force them to
>>>>>   reproduce
>>>>>>   it, let alone make them add another module to the classpath and restart
>>>>>   the
>>>>>>   nodes.
>>>>>>   I think having this functionality in the core module with the ability 
>>>>>> to
>>>>>>   disable/enable it is the right approach. Having the information printed
>>>>>   to
>>>>>>   log is ok, having it in an event that can be sent to a 
>>>>>> monitoring/tracing
>>>>>>   subsystem is even better.
>>>>>>
>>>>>>   Kirill, can we enable and disable this feature in runtime to avoid the
>>>>>   very
>>>>>>   same nodes restart?
>>>>>>
>>>>>>   [1]
>>>>>   https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/

Re: Extended logging for rebalance performance analysis

Reply via email to