Re: [DISCUSS] FLIP-434: Support optimizations for pre-partitioned data sources

Jane Chan Tue, 12 Mar 2024 00:56:25 -0700

Hi Jeyhun,

Thank you for leading the discussion. I'm generally +1 with this proposal,
along with some questions. Please see my comments below.

1. Concerning the `sourcePartitions()` method, the partition information
returned during the optimization phase may not be the same as the partition
information during runtime execution. For long-running jobs, partitions may
be continuously created. Is this FLIP equipped to handle scenarios?

2. Regarding the `RemoveRedundantShuffleRule` optimization rule, I
understand that it is also necessary to verify whether the hash key within
the Exchange node is consistent with the partition key defined in the table
source that implements `SupportsPartitioning`.

3. Could you elaborate on the desired physical plan and integration with
`CompiledPlan` to enhance the overall functionality?

Best,
Jane

On Tue, Mar 12, 2024 at 11:11 AM Jim Hughes <jhug...@confluent.io.invalid>
wrote:

> Hi Jeyhun,
>
> I like the idea!  Given FLIP-376[1], I wonder if it'd make sense to
> generalize FLIP-434 to be about "pre-divided" data to cover "buckets" and
> "partitions" (and maybe even situations where a data source is partitioned
> and bucketed).
>
> Separate from that, the page mentions TPC-H Q1 as an example.  For a join,
> any two tables joined on the same bucket key should provide a concrete
> example of a join.  Systems like Kafka Streams/ksqlDB call this
> "co-partitioning"; for those systems, it is a requirement placed on the
> input sources.  For Flink, with FLIP-434, the proposed planner rule
> could remove the shuffle.
>
> Definitely a fun idea; I look forward to hearing more!
>
> Cheers,
>
> Jim
>
>
> 1.
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-376%3A+Add+DISTRIBUTED+BY+clause
> 2.
>
> https://docs.ksqldb.io/en/latest/developer-guide/joins/partition-data/#co-partitioning-requirements
>
> On Sun, Mar 10, 2024 at 3:38 PM Jeyhun Karimov <je.kari...@gmail.com>
> wrote:
>
> > Hi devs,
> >
> > I’d like to start a discussion on FLIP-434: Support optimizations for
> > pre-partitioned data sources [1].
> >
> > The FLIP introduces taking advantage of pre-partitioned data sources for
> > SQL/Table API (it is already supported as experimental feature in
> > DataStream API [2]).
> >
> >
> > Please find more details in the FLIP wiki document [1].
> > Looking forward to your feedback.
> >
> > Regards,
> > Jeyhun
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-434%3A+Support+optimizations+for+pre-partitioned+data+sources
> > [2]
> >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/experimental/
> >
>

Re: [DISCUSS] FLIP-434: Support optimizations for pre-partitioned data sources

Reply via email to