Re: [DISCUSS] FLIP-462: Support Custom Data Distribution for Input Stream of Lookup Join

Zhanghao Chen Tue, 11 Jun 2024 23:58:54 -0700

Thanks for the clarification, that makes sense. +1 for the proposal.

Best,
Zhanghao Chen
________________________________
From: weijie guo <[email protected]>
Sent: Wednesday, June 12, 2024 14:20
To: [email protected] <[email protected]>
Subject: Re: [DISCUSS] FLIP-462: Support Custom Data Distribution for Input 
Stream of Lookup Join


Hi Zhanghao,

Thanks for the reply!

> Could you give a more concrete example in production on when a custom
partitioning strategy will outperform partitioning by key

The key point here is partitioning logic cannot be fully expressed with all
or part of the join key. That is, even if we know which fields are used to
calculate buckets, still have to face the following problem:

1. The mapping from the bucket field to the bucket id is not necessarily
done via hashcode, and even if it is, the hash algorithm may be different
from the one used in Flink. The planner can't know how to do this mapping.
2. In order to get the bucket id, we have to mod the bucket number, but
planner has no notion of bucket number.



Best regards,

Weijie


Zhanghao Chen <[email protected]> 于2024年6月12日周三 13:55写道：

> Thanks for driving this, Weijie. Usually, the data distribution of the
> external system is closely related to the keys, e.g. computing the bucket
> index by key hashcode % bucket num, so I'm not sure about how much
> difference there are between partitioning by key and a custom partitioning
> strategy. Could you give a more concrete example in production on when a
> custom partitioning strategy will outperform partitioning by key? Since
> you've mentioned Paimon in doc, maybe an example on Paimon.
>
> Best,
> Zhanghao Chen
> ________________________________
> From: weijie guo <[email protected]>
> Sent: Friday, June 7, 2024 9:59
> To: dev <[email protected]>
> Subject: [DISCUSS] FLIP-462: Support Custom Data Distribution for Input
> Stream of Lookup Join
>
> Hi devs,
>
>
> I'd like to start a discussion about FLIP-462[1]: Support Custom Data
> Distribution for Input Stream of Lookup Join.
>
>
> Lookup Join is an important feature in Flink, It is typically used to
> enrich a table with data that is queried from an external system.
> If we interact with the external systems for each incoming record, we
> incur significant network IO and RPC overhead.
>
> Therefore, most connectors introduce caching to reduce the per-record
> level query overhead. However, because the data distribution of Lookup
> Join's input stream is arbitrary, the cache hit rate is sometimes
> unsatisfactory.
>
>
> We want to introduce a mechanism for the connector to tell the Flink
> planner its desired input stream data distribution or partitioning
> strategy. This can significantly reduce the amount of cached data and
> improve performance of Lookup Join.
>
>
> You can find more details in this FLIP[1]. Looking forward to hearing
> from you, thanks!
>
>
> Best regards,
>
> Weijie
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-462+Support+Custom+Data+Distribution+for+Input+Stream+of+Lookup+Join
>

Re: [DISCUSS] FLIP-462: Support Custom Data Distribution for Input Stream of Lookup Join

Reply via email to