Re: [EXTERNAL] Re: [DISCUSS] Column to Column filtering

Baldwin, Jennifer Mon, 30 Sep 2024 12:56:47 -0700

It has come to my attention that there was no attachment.  I have created 
google doc instead.  Thanks.

https://docs.google.com/document/d/1HZa3AyPPfgz9VOVA9rPhJJ8f3F-3tEel_53nIlvYlo0/edit?usp=sharing

From: Baldwin, Jennifer <jennifer.bald...@teradata.com.INVALID>
Date: Friday, September 27, 2024 at 12:54 PM
To: dev@iceberg.apache.org <dev@iceberg.apache.org>
Cc: jennifer.bald...@teradata.com.invalid 
<jennifer.bald...@teradata.com.INVALID>
Subject: Re: [EXTERNAL] Re: [DISCUSS] Column to Column filtering
You don't often get email from jennifer.bald...@teradata.com.invalid. Learn why 
this is important<https://aka.ms/LearnAboutSenderIdentification>
Please see attached, I hope this provides you with more clarity on the use case 
we hope to support.  Let me know if you have any further questions

From: Russell Spitzer <russell.spit...@gmail.com>
Date: Wednesday, September 18, 2024 at 6:15 PM
To: dev@iceberg.apache.org <dev@iceberg.apache.org>
Cc: jennifer.bald...@teradata.com.invalid 
<jennifer.bald...@teradata.com.invalid>
Subject: [EXTERNAL] Re: [DISCUSS] Column to Column filtering
[CAUTION: External Email]

I have similar concerns to Ryan although I could see that if we were writing 
smaller and better correlated files that this could be a big help. Specifically 
with variant use cases this may be very useful. I would love to hear more about 
the use cases and rationale for adding this. Do you have any specific examples 
you can go into detail on?

On Wed, Sep 18, 2024 at 4:48 PM rdb...@gmail.com<mailto:rdb...@gmail.com> 
<rdb...@gmail.com<mailto:rdb...@gmail.com>> wrote:
I'm curious to learn more about this feature. Is there a driving use case that 
you're implementing it for? Are there common situations in which these filters 
are helpful and selective?

My initial impression is that this kind of expression would have limited 
utility at the table format level. Iceberg tracks column ranges for data files 
and the primary use case for filtering is to skip data files at the scan 
planning phase. For a column-to-column comparison, you would only be able to 
eliminate data files that have non-overlapping ranges. That is, if you're 
looking for rows where x < y, you can only eliminate a file when max(x) < 
min(y). To me, it seems unlikely that this would be generic enough to be worth 
it, but if there are use cases where this can happen and speed up queries I 
think it may make sense.

Ryan

On Tue, Sep 17, 2024 at 6:21 AM Baldwin, Jennifer 
<jennifer.bald...@teradata.com.invalid> wrote:
I’m starting a thread to discuss a feature for comparisons using column 
references on the left and right side of an expression wherever iceberg 
supports column reference to literal value(s) comparisons.  The use case we 
want to support is filtering of date columns from a single table.  For instance:

select * from travel_table
where expected_date > travel_date;

select * from travel_table
where payment_date <>  due_date;

The changes will impact row and scan file filtering.  Impacted jars are 
iceberg-api, iceberg-core, iceberg-orc and iceberg-parquet.

Is this a feature the Iceberg community would be willing to accept?

Here is a link to a Draft PR with current changes, Thanks.
https://github.com/apache/iceberg/pull/11152

Re: [EXTERNAL] Re: [DISCUSS] Column to Column filtering

Reply via email to