Re: [EXTERNAL] Re: [DISCUSS] Column to Column filtering

Benny Chow Thu, 03 Oct 2024 23:41:11 -0700

Assuming the table contained smaller and better correlated files, I think a
workaround where you materialized the timestamp difference between two
columns could be effective for data file pruning.  So if a particular
planned departure date was associated with a lot of delays and the table
was partitioned by destination_cd and sorted by planned departure date,
materializing the diff between planned departure date and actual departure
date will result in a single field with min/max bounds that could be
filtered on.  You could then get data file pruning for a filter like late
departure but not more than an hour late.


On Mon, Sep 30, 2024 at 12:56 PM Baldwin, Jennifer
<jennifer.bald...@teradata.com.invalid> wrote:

> It has come to my attention that there was no attachment.  I have created
> google doc instead.  Thanks.
>
>
>
>
> https://docs.google.com/document/d/1HZa3AyPPfgz9VOVA9rPhJJ8f3F-3tEel_53nIlvYlo0/edit?usp=sharing
>
>
>
> *From: *Baldwin, Jennifer <jennifer.bald...@teradata.com.INVALID>
> *Date: *Friday, September 27, 2024 at 12:54 PM
> *To: *dev@iceberg.apache.org <dev@iceberg.apache.org>
> *Cc: *jennifer.bald...@teradata.com.invalid
> <jennifer.bald...@teradata.com.INVALID>
> *Subject: *Re: [EXTERNAL] Re: [DISCUSS] Column to Column filtering
>
> You don't often get email from jennifer.bald...@teradata.com.invalid. Learn
> why this is important <https://aka.ms/LearnAboutSenderIdentification>
>
> Please see attached, I hope this provides you with more clarity on the use
> case we hope to support.  Let me know if you have any further questions
>
>
>
> *From: *Russell Spitzer <russell.spit...@gmail.com>
> *Date: *Wednesday, September 18, 2024 at 6:15 PM
> *To: *dev@iceberg.apache.org <dev@iceberg.apache.org>
> *Cc: *jennifer.bald...@teradata.com.invalid
> <jennifer.bald...@teradata.com.invalid>
> *Subject: *[EXTERNAL] Re: [DISCUSS] Column to Column filtering
>
> [CAUTION: External Email]
>
>
>
> I have similar concerns to Ryan although I could see that if we were
> writing smaller and better correlated files that this could be a big help.
> Specifically with variant use cases this may be very useful. I would love
> to hear more about the use cases and rationale for adding this. Do you have
> any specific examples you can go into detail on?
>
>
>
> On Wed, Sep 18, 2024 at 4:48 PM rdb...@gmail.com <rdb...@gmail.com> wrote:
>
> I'm curious to learn more about this feature. Is there a driving use case
> that you're implementing it for? Are there common situations in which these
> filters are helpful and selective?
>
>
>
> My initial impression is that this kind of expression would have limited
> utility at the table format level. Iceberg tracks column ranges for data
> files and the primary use case for filtering is to skip data files at the
> scan planning phase. For a column-to-column comparison, you would only be
> able to eliminate data files that have non-overlapping ranges. That is, if
> you're looking for rows where x < y, you can only eliminate a file when
> max(x) < min(y). To me, it seems unlikely that this would be generic enough
> to be worth it, but if there are use cases where this can happen and speed
> up queries I think it may make sense.
>
>
>
> Ryan
>
>
>
> On Tue, Sep 17, 2024 at 6:21 AM Baldwin, Jennifer
> <jennifer.bald...@teradata.com.invalid> wrote:
>
> I’m starting a thread to discuss a feature for comparisons using column
> references on the left and right side of an expression wherever iceberg
> supports column reference to literal value(s) comparisons.  The use case we
> want to support is filtering of date columns from a single table.  For
> instance:
>
>
>
> select * from travel_table
>
> where expected_date > travel_date;
>
>
>
> select * from travel_table
>
> where payment_date <>  due_date;
>
>
>
>
>
> The changes will impact row and scan file filtering.  Impacted jars are
> iceberg-api, iceberg-core, iceberg-orc and iceberg-parquet.
>
>
>
> Is this a feature the Iceberg community would be willing to accept?
>
>
>
> Here is a link to a Draft PR with current changes, Thanks.
>
> https://github.com/apache/iceberg/pull/11152
>
>
>
>

Re: [EXTERNAL] Re: [DISCUSS] Column to Column filtering

Reply via email to