Re: [DISCUSS] FLIP-462: Support Custom Data Distribution for Input Stream of Lookup Join

Xintong Song Thu, 06 Jun 2024 22:53:12 -0700

+1 for this proposal.

This FLIP will make it possible for each lookup join parallel task to only
access and cache a subset of the data. This will significantly improve the
performance and reduce the overhead when using Paimon for the dimension
table. And it's general enough to also be leveraged by other connectors.


Best,

Xintong



On Fri, Jun 7, 2024 at 10:01 AM weijie guo <guoweijieres...@gmail.com>
wrote:

> Hi devs,
>
>
> I'd like to start a discussion about FLIP-462[1]: Support Custom Data
> Distribution for Input Stream of Lookup Join.
>
>
> Lookup Join is an important feature in Flink, It is typically used to
> enrich a table with data that is queried from an external system.
> If we interact with the external systems for each incoming record, we
> incur significant network IO and RPC overhead.
>
> Therefore, most connectors introduce caching to reduce the per-record
> level query overhead. However, because the data distribution of Lookup
> Join's input stream is arbitrary, the cache hit rate is sometimes
> unsatisfactory.
>
>
> We want to introduce a mechanism for the connector to tell the Flink
> planner its desired input stream data distribution or partitioning
> strategy. This can significantly reduce the amount of cached data and
> improve performance of Lookup Join.
>
>
> You can find more details in this FLIP[1]. Looking forward to hearing
> from you, thanks!
>
>
> Best regards,
>
> Weijie
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-462+Support+Custom+Data+Distribution+for+Input+Stream+of+Lookup+Join
>

Re: [DISCUSS] FLIP-462: Support Custom Data Distribution for Input Stream of Lookup Join

Reply via email to