Re: [DISCUSS] Column to Column filtering

[email protected] Wed, 18 Sep 2024 14:48:28 -0700

I'm curious to learn more about this feature. Is there a driving use case
that you're implementing it for? Are there common situations in which these
filters are helpful and selective?

My initial impression is that this kind of expression would have limited
utility at the table format level. Iceberg tracks column ranges for data
files and the primary use case for filtering is to skip data files at the
scan planning phase. For a column-to-column comparison, you would only be
able to eliminate data files that have non-overlapping ranges. That is, if
you're looking for rows where x < y, you can only eliminate a file when
max(x) < min(y). To me, it seems unlikely that this would be generic enough
to be worth it, but if there are use cases where this can happen and speed
up queries I think it may make sense.

Ryan

On Tue, Sep 17, 2024 at 6:21 AM Baldwin, Jennifer
<[email protected]> wrote:

> I’m starting a thread to discuss a feature for comparisons using column
> references on the left and right side of an expression wherever iceberg
> supports column reference to literal value(s) comparisons.  The use case we
> want to support is filtering of date columns from a single table.  For
> instance:
>
>
>
> select * from travel_table
>
> where expected_date > travel_date;
>
>
>
> select * from travel_table
>
> where payment_date <>  due_date;
>
>
>
>
>
> The changes will impact row and scan file filtering.  Impacted jars are
> iceberg-api, iceberg-core, iceberg-orc and iceberg-parquet.
>
>
>
> Is this a feature the Iceberg community would be willing to accept?
>
>
>
> Here is a link to a Draft PR with current changes, Thanks.
>
> https://github.com/apache/iceberg/pull/11152
>
>
>

Re: [DISCUSS] Column to Column filtering

Reply via email to