Re: [EXTERNAL] Re: [DISCUSS] Column to Column filtering

Baldwin, Jennifer Mon, 14 Oct 2024 11:23:30 -0700

Thanks everyone for your input, the use case in the document is something that 
I made up to describe how the filtering could apply. It is not a real world use 
case.  I am being careful to not share our customer’s specific use case but it 
does use dates. 😊  I’ll see what I can come up with to provide more clarity.

From: rdb...@gmail.com <rdb...@gmail.com>
Date: Friday, October 4, 2024 at 6:32 PM
To: dev@iceberg.apache.org <dev@iceberg.apache.org>
Subject: Re: [EXTERNAL] Re: [DISCUSS] Column to Column filtering
I don't think that the doc shows cases where this filtering would eliminate 
data files. The example in the doc is airline flights, where a flight has both 
a scheduled departure time and an actual departure time. The query looks for 
cases where the actual departure is later than the scheduled departure. (And 
similar for arrival times.)

In order to evaluate whether a column-to-column filter would be helpful, I 
think we need a way to think about how the data would be laid out in files. You 
could have files that are written as the complete records are created -- in 
that case the records would be basically clustered by actual arrival time. Or 
you might recluster the data to organize it by flight number or by some 
departure time.

Assuming the first case, the arrival times of a data file would be clustered 
into a small period of time, but the flights could be of varying duration. So 
the scheduled departure times could be all over the place, from 8 hours ago to 
1 hour ago. Similarly, the actual departure times would have a wide range. As a 
result, it's likely that those ranges would overlap and a column-to-column 
filter would not be able to select files. For the arrival time query, there's a 
similar problem. Assuming most flights are on time, the arrival time and 
scheduled arrival time (upper bound) would be similar for at least one flight, 
so the ranges would likely overlap and the expression wouldn't be helpful again.

If I think about clustering the data by departure time, the same logic applies 
and I don't think there would be much value to the expression. If you were to 
cluster by other columns, then you'd get the same conclusion. Clustering by 
flight number, for example, would more randomly distribute flights and result 
in wider overall ranges that still overlap.

The only case in which I think this might work is if the data were clustered by 
a derived delay value -- but wouldn't it be easier to simply calculate that 
value as a column and filter by it directly?

I definitely see the value of column-to-column filters for rows, but in a table 
format it doesn't seem to me like it would result in much filtering. Maybe 
there's a different use case?

On Thu, Oct 3, 2024 at 11:41 PM Benny Chow 
<btc...@gmail.com<mailto:btc...@gmail.com>> wrote:
Assuming the table contained smaller and better correlated files, I think a 
workaround where you materialized the timestamp difference between two columns 
could be effective for data file pruning.  So if a particular planned departure 
date was associated with a lot of delays and the table was partitioned by 
destination_cd and sorted by planned departure date, materializing the diff 
between planned departure date and actual departure date will result in a 
single field with min/max bounds that could be filtered on.  You could then get 
data file pruning for a filter like late departure but not more than an hour 
late.

On Mon, Sep 30, 2024 at 12:56 PM Baldwin, Jennifer 
<jennifer.bald...@teradata.com.invalid> wrote:
It has come to my attention that there was no attachment.  I have created 
google doc instead.  Thanks.

https://docs.google.com/document/d/1HZa3AyPPfgz9VOVA9rPhJJ8f3F-3tEel_53nIlvYlo0/edit?usp=sharing

From: Baldwin, Jennifer <jennifer.bald...@teradata.com.INVALID>
Date: Friday, September 27, 2024 at 12:54 PM
To: dev@iceberg.apache.org<mailto:dev@iceberg.apache.org> 
<dev@iceberg.apache.org<mailto:dev@iceberg.apache.org>>
Cc: jennifer.bald...@teradata.com.invalid 
<jennifer.bald...@teradata.com.INVALID>
Subject: Re: [EXTERNAL] Re: [DISCUSS] Column to Column filtering
You don't often get email from jennifer.bald...@teradata.com.invalid. Learn why 
this is important<https://aka.ms/LearnAboutSenderIdentification>
Please see attached, I hope this provides you with more clarity on the use case 
we hope to support.  Let me know if you have any further questions

From: Russell Spitzer 
<russell.spit...@gmail.com<mailto:russell.spit...@gmail.com>>
Date: Wednesday, September 18, 2024 at 6:15 PM
To: dev@iceberg.apache.org<mailto:dev@iceberg.apache.org> 
<dev@iceberg.apache.org<mailto:dev@iceberg.apache.org>>
Cc: jennifer.bald...@teradata.com.invalid 
<jennifer.bald...@teradata.com.invalid>
Subject: [EXTERNAL] Re: [DISCUSS] Column to Column filtering
[CAUTION: External Email]

I have similar concerns to Ryan although I could see that if we were writing 
smaller and better correlated files that this could be a big help. Specifically 
with variant use cases this may be very useful. I would love to hear more about 
the use cases and rationale for adding this. Do you have any specific examples 
you can go into detail on?

On Wed, Sep 18, 2024 at 4:48 PM rdb...@gmail.com<mailto:rdb...@gmail.com> 
<rdb...@gmail.com<mailto:rdb...@gmail.com>> wrote:
I'm curious to learn more about this feature. Is there a driving use case that 
you're implementing it for? Are there common situations in which these filters 
are helpful and selective?

My initial impression is that this kind of expression would have limited 
utility at the table format level. Iceberg tracks column ranges for data files 
and the primary use case for filtering is to skip data files at the scan 
planning phase. For a column-to-column comparison, you would only be able to 
eliminate data files that have non-overlapping ranges. That is, if you're 
looking for rows where x < y, you can only eliminate a file when max(x) < 
min(y). To me, it seems unlikely that this would be generic enough to be worth 
it, but if there are use cases where this can happen and speed up queries I 
think it may make sense.

Ryan

On Tue, Sep 17, 2024 at 6:21 AM Baldwin, Jennifer 
<jennifer.bald...@teradata.com.invalid> wrote:
I’m starting a thread to discuss a feature for comparisons using column 
references on the left and right side of an expression wherever iceberg 
supports column reference to literal value(s) comparisons.  The use case we 
want to support is filtering of date columns from a single table.  For instance:

select * from travel_table
where expected_date > travel_date;

select * from travel_table
where payment_date <>  due_date;

The changes will impact row and scan file filtering.  Impacted jars are 
iceberg-api, iceberg-core, iceberg-orc and iceberg-parquet.

Is this a feature the Iceberg community would be willing to accept?

Here is a link to a Draft PR with current changes, Thanks.
https://github.com/apache/iceberg/pull/11152

Re: [EXTERNAL] Re: [DISCUSS] Column to Column filtering

Reply via email to