It has come to my attention that there was no attachment. I have created google doc instead. Thanks.
https://docs.google.com/document/d/1HZa3AyPPfgz9VOVA9rPhJJ8f3F-3tEel_53nIlvYlo0/edit?usp=sharing From: Baldwin, Jennifer <jennifer.bald...@teradata.com.INVALID> Date: Friday, September 27, 2024 at 12:54 PM To: dev@iceberg.apache.org <dev@iceberg.apache.org> Cc: jennifer.bald...@teradata.com.invalid <jennifer.bald...@teradata.com.INVALID> Subject: Re: [EXTERNAL] Re: [DISCUSS] Column to Column filtering You don't often get email from jennifer.bald...@teradata.com.invalid. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> Please see attached, I hope this provides you with more clarity on the use case we hope to support. Let me know if you have any further questions From: Russell Spitzer <russell.spit...@gmail.com> Date: Wednesday, September 18, 2024 at 6:15 PM To: dev@iceberg.apache.org <dev@iceberg.apache.org> Cc: jennifer.bald...@teradata.com.invalid <jennifer.bald...@teradata.com.invalid> Subject: [EXTERNAL] Re: [DISCUSS] Column to Column filtering [CAUTION: External Email] I have similar concerns to Ryan although I could see that if we were writing smaller and better correlated files that this could be a big help. Specifically with variant use cases this may be very useful. I would love to hear more about the use cases and rationale for adding this. Do you have any specific examples you can go into detail on? On Wed, Sep 18, 2024 at 4:48 PM rdb...@gmail.com<mailto:rdb...@gmail.com> <rdb...@gmail.com<mailto:rdb...@gmail.com>> wrote: I'm curious to learn more about this feature. Is there a driving use case that you're implementing it for? Are there common situations in which these filters are helpful and selective? My initial impression is that this kind of expression would have limited utility at the table format level. Iceberg tracks column ranges for data files and the primary use case for filtering is to skip data files at the scan planning phase. For a column-to-column comparison, you would only be able to eliminate data files that have non-overlapping ranges. That is, if you're looking for rows where x < y, you can only eliminate a file when max(x) < min(y). To me, it seems unlikely that this would be generic enough to be worth it, but if there are use cases where this can happen and speed up queries I think it may make sense. Ryan On Tue, Sep 17, 2024 at 6:21 AM Baldwin, Jennifer <jennifer.bald...@teradata.com.invalid> wrote: I’m starting a thread to discuss a feature for comparisons using column references on the left and right side of an expression wherever iceberg supports column reference to literal value(s) comparisons. The use case we want to support is filtering of date columns from a single table. For instance: select * from travel_table where expected_date > travel_date; select * from travel_table where payment_date <> due_date; The changes will impact row and scan file filtering. Impacted jars are iceberg-api, iceberg-core, iceberg-orc and iceberg-parquet. Is this a feature the Iceberg community would be willing to accept? Here is a link to a Draft PR with current changes, Thanks. https://github.com/apache/iceberg/pull/11152