I'm curious to learn more about this feature. Is there a driving use case that you're implementing it for? Are there common situations in which these filters are helpful and selective?
My initial impression is that this kind of expression would have limited utility at the table format level. Iceberg tracks column ranges for data files and the primary use case for filtering is to skip data files at the scan planning phase. For a column-to-column comparison, you would only be able to eliminate data files that have non-overlapping ranges. That is, if you're looking for rows where x < y, you can only eliminate a file when max(x) < min(y). To me, it seems unlikely that this would be generic enough to be worth it, but if there are use cases where this can happen and speed up queries I think it may make sense. Ryan On Tue, Sep 17, 2024 at 6:21 AM Baldwin, Jennifer <jennifer.bald...@teradata.com.invalid> wrote: > I’m starting a thread to discuss a feature for comparisons using column > references on the left and right side of an expression wherever iceberg > supports column reference to literal value(s) comparisons. The use case we > want to support is filtering of date columns from a single table. For > instance: > > > > select * from travel_table > > where expected_date > travel_date; > > > > select * from travel_table > > where payment_date <> due_date; > > > > > > The changes will impact row and scan file filtering. Impacted jars are > iceberg-api, iceberg-core, iceberg-orc and iceberg-parquet. > > > > Is this a feature the Iceberg community would be willing to accept? > > > > Here is a link to a Draft PR with current changes, Thanks. > > https://github.com/apache/iceberg/pull/11152 > > >