Re: [DISCUSS] FLIP-248: Introduce dynamic partition pruning

Jark Wu Mon, 11 Jul 2022 22:16:03 -0700

I agree with Jingsong. DPP is a particular case of Dynamic Filter Pushdown
that the join key contains partition fields.  Extending this FLIP to
general filter
pushdown can benefit more optimizations, and they can share the same
interface.


For example, Trino Hive Connector leverages dynamic filtering to support:
- dynamic partition pruning for partitioned tables
- and dynamic bucket pruning for bucket tables
- and dynamic filter pushed into the ORC and Parquet readers to perform
stripe
  or row-group pruning and save on disk I/O.

Therefore, +1 to extend this FLIP to Dynamic Filter Pushdown (or Dynamic
Filtering),
just like Trino [1].  The interfaces should also be adapted for that.

Besides, maybe the FLIP should also demonstrate the EXPLAIN result, which
is also an API.

Best,
Jark

[1]: https://trino.io/docs/current/admin/dynamic-filtering.html










On Tue, 12 Jul 2022 at 09:59, Jingsong Li <jingsongl...@gmail.com> wrote:

> Thanks Godfrey for driving.
>
> I like this FLIP.
>
> We can restrict this capability to more than just partitions.
> Here are some inputs from Lake Storage.
>
> The format of the splits generated by Lake Storage is roughly as follows:
> Split {
>    Path filePath;
>    Statistics[] fieldStats;
> }
>
> Stats contain the min and max of each column.
>
> If the storage is sorted by a column, this means that the split
> filtering on that column will be very good, so not only the partition
> field, but also this column is worthy of being pushed down the
> RuntimeFilter.
> This information can only be known by source, so I suggest that source
> return which fields are worthy of being pushed down.
>
> My overall point is:
> This FLIP can be extended to support Source Runtime Filter push-down
> for all fields, not just dynamic partition pruning.
>
> What do you think?
>
> Best,
> Jingsong
>
> On Fri, Jul 8, 2022 at 10:12 PM godfrey he <godfre...@gmail.com> wrote:
> >
> > Hi all,
> >
> > I would like to open a discussion on FLIP-248: Introduce dynamic
> > partition pruning.
> >
> >  Currently, Flink supports static partition pruning: the conditions in
> > the WHERE clause are analyzed
> > to determine in advance which partitions can be safely skipped in the
> > optimization phase.
> > Another common scenario: the partitions information is not available
> > in the optimization phase but in the execution phase.
> > That's the problem this FLIP is trying to solve: dynamic partition
> > pruning, which could reduce the partition table source IO.
> >
> > The query pattern looks like:
> > select * from store_returns, date_dim where sr_returned_date_sk =
> > d_date_sk and d_year = 2000
> >
> > We will introduce a mechanism for detecting dynamic partition pruning
> > patterns in optimization phase
> > and performing partition pruning at runtime by sending the dimension
> > table results to the SplitEnumerator
> > of fact table via existing coordinator mechanism.
> >
> > You can find more details in FLIP-248 document[1].
> > Looking forward to your any feedback.
> >
> > [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-248%3A+Introduce+dynamic+partition+pruning
> > [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-248
> >
> >
> > Best,
> > Godfrey
>

Re: [DISCUSS] FLIP-248: Introduce dynamic partition pruning

Reply via email to