Assuming the table contained smaller and better correlated files, I think a workaround where you materialized the timestamp difference between two columns could be effective for data file pruning. So if a particular planned departure date was associated with a lot of delays and the table was partitioned by destination_cd and sorted by planned departure date, materializing the diff between planned departure date and actual departure date will result in a single field with min/max bounds that could be filtered on. You could then get data file pruning for a filter like late departure but not more than an hour late.
On Mon, Sep 30, 2024 at 12:56 PM Baldwin, Jennifer <jennifer.bald...@teradata.com.invalid> wrote: > It has come to my attention that there was no attachment. I have created > google doc instead. Thanks. > > > > > https://docs.google.com/document/d/1HZa3AyPPfgz9VOVA9rPhJJ8f3F-3tEel_53nIlvYlo0/edit?usp=sharing > > > > *From: *Baldwin, Jennifer <jennifer.bald...@teradata.com.INVALID> > *Date: *Friday, September 27, 2024 at 12:54 PM > *To: *dev@iceberg.apache.org <dev@iceberg.apache.org> > *Cc: *jennifer.bald...@teradata.com.invalid > <jennifer.bald...@teradata.com.INVALID> > *Subject: *Re: [EXTERNAL] Re: [DISCUSS] Column to Column filtering > > You don't often get email from jennifer.bald...@teradata.com.invalid. Learn > why this is important <https://aka.ms/LearnAboutSenderIdentification> > > Please see attached, I hope this provides you with more clarity on the use > case we hope to support. Let me know if you have any further questions > > > > *From: *Russell Spitzer <russell.spit...@gmail.com> > *Date: *Wednesday, September 18, 2024 at 6:15 PM > *To: *dev@iceberg.apache.org <dev@iceberg.apache.org> > *Cc: *jennifer.bald...@teradata.com.invalid > <jennifer.bald...@teradata.com.invalid> > *Subject: *[EXTERNAL] Re: [DISCUSS] Column to Column filtering > > [CAUTION: External Email] > > > > I have similar concerns to Ryan although I could see that if we were > writing smaller and better correlated files that this could be a big help. > Specifically with variant use cases this may be very useful. I would love > to hear more about the use cases and rationale for adding this. Do you have > any specific examples you can go into detail on? > > > > On Wed, Sep 18, 2024 at 4:48 PM rdb...@gmail.com <rdb...@gmail.com> wrote: > > I'm curious to learn more about this feature. Is there a driving use case > that you're implementing it for? Are there common situations in which these > filters are helpful and selective? > > > > My initial impression is that this kind of expression would have limited > utility at the table format level. Iceberg tracks column ranges for data > files and the primary use case for filtering is to skip data files at the > scan planning phase. For a column-to-column comparison, you would only be > able to eliminate data files that have non-overlapping ranges. That is, if > you're looking for rows where x < y, you can only eliminate a file when > max(x) < min(y). To me, it seems unlikely that this would be generic enough > to be worth it, but if there are use cases where this can happen and speed > up queries I think it may make sense. > > > > Ryan > > > > On Tue, Sep 17, 2024 at 6:21 AM Baldwin, Jennifer > <jennifer.bald...@teradata.com.invalid> wrote: > > I’m starting a thread to discuss a feature for comparisons using column > references on the left and right side of an expression wherever iceberg > supports column reference to literal value(s) comparisons. The use case we > want to support is filtering of date columns from a single table. For > instance: > > > > select * from travel_table > > where expected_date > travel_date; > > > > select * from travel_table > > where payment_date <> due_date; > > > > > > The changes will impact row and scan file filtering. Impacted jars are > iceberg-api, iceberg-core, iceberg-orc and iceberg-parquet. > > > > Is this a feature the Iceberg community would be willing to accept? > > > > Here is a link to a Draft PR with current changes, Thanks. > > https://github.com/apache/iceberg/pull/11152 > > > >