Hi Vladislav,

Thanks for raising this topic.
Currently present IGNITE_PDS_WAL_REBALANCE_THRESHOLD (default is 500_000)
is controversial. Assuming that the default number of partitions is 1024,
cache should contain a really huge amount of data in order to make WAL
delta rebalancing possible. In fact, it's currently disabled for most
production cases, which makes rebalancing of persistent caches unreasonably
long.

I think, your approach [1] makes much more sense than the current
heuristic, let's move forward with the proposed solution.

Though, there are some other corner cases, e.g. this one:
- Configured size of WAL archive is big (>100 GB)
- Cache has small partitions (e.g. 1000 entries)
- Infrequent updates (e.g. ~100 in the whole WAL history of any node)
- There is another cache with very frequent updates which allocate >99% of
WAL
In such scenario we may need to iterate over >100 GB of WAL in order to
fetch <1% of needed updates. Even though the amount of network traffic is
still optimized, it would be more effective to transfer partitions with
~1000 entries fully instead of reading >100 GB of WAL.

I want to highlight that your heuristic definitely makes the situation
better, but due to possible corner cases we should keep the fallback lever
to restrict or limit historical rebalance as before. Probably, it would be
handy to keep IGNITE_PDS_WAL_REBALANCE_THRESHOLD property with a low
default value (1000, 500 or even 0) and apply your heuristic only for
partitions with bigger size.

Regarding case [2]: it looks like an improvement that can mitigate some
corner cases (including the one that I have described). I'm ok with it as
long as it takes data updates reordering on backup nodes into account. We
don't track skipped updates for atomic caches. As a result, detection of
the absence of updates between two checkpoint markers with the same
partition counter can be false positive.

--
Best Regards,
Ivan Rakov

On Tue, Jul 14, 2020 at 3:03 PM Vladislav Pyatkov <vldpyat...@gmail.com>
wrote:

> Hi guys,
>
> I want to implement a more honest heuristic for historical rebalance.
> Before, a cluster makes a choice between the historical rebalance or not it
> only from a partition size. This threshold more known by a name of property
> IGNITE_PDS_WAL_REBALANCE_THRESHOLD.
> It might prevent a historical rebalance when a partition is too small, but
> not if WAL contains more updates than a size of partition, historical
> rebalance still can be chosen.
> There is a ticket where need to implement more fair heuristic[1].
>
> My idea for implementation is need to estimate a size of data which will be
> transferred owe network. In other word if need to rebalance a part of WAL
> that contains N updates, for recover a partition on another node, which
> have to contain M rows at all, need chooses a historical rebalance on the
> case where N < M (WAL history should be presented as well).
>
> This approach is easy implemented, because a coordinator node has the size
> of partitions and counters' interval. But in this case cluster still can
> find not many updates in too long WAL history. I assume a possibility to
> work around it, if rebalance historical iterator will not handle
> checkpoints where not contains updates of particular cache. Checkpoints can
> skip if counters for the cache (maybe even a specific partitions) was not
> changed between it and next one.
>
> Ticket for improvement rebalance historical iterator[2]
>
> I want to hear a view of community on the thought above.
> Maybe anyone has another opinion?
>
> [1]: https://issues.apache.org/jira/browse/IGNITE-13253
> [2]: https://issues.apache.org/jira/browse/IGNITE-13254
>
> --
> Vladislav Pyatkov
>

Reply via email to