I don't think that the doc shows cases where this filtering would eliminate
data files. The example in the doc is airline flights, where a flight has
both a scheduled departure time and an actual departure time. The query
looks for cases where the actual departure is later than the scheduled
departure. (And similar for arrival times.)

In order to evaluate whether a column-to-column filter would be helpful, I
think we need a way to think about how the data would be laid out in files.
You could have files that are written as the complete records are created
-- in that case the records would be basically clustered by actual arrival
time. Or you might recluster the data to organize it by flight number or by
some departure time.

Assuming the first case, the arrival times of a data file would be
clustered into a small period of time, but the flights could be of varying
duration. So the scheduled departure times could be all over the place,
from 8 hours ago to 1 hour ago. Similarly, the actual departure times would
have a wide range. As a result, it's likely that those ranges would overlap
and a column-to-column filter would not be able to select files. For the
arrival time query, there's a similar problem. Assuming most flights are on
time, the arrival time and scheduled arrival time (upper bound) would be
similar for at least one flight, so the ranges would likely overlap and the
expression wouldn't be helpful again.

If I think about clustering the data by departure time, the same logic
applies and I don't think there would be much value to the expression. If
you were to cluster by other columns, then you'd get the same conclusion.
Clustering by flight number, for example, would more randomly distribute
flights and result in wider overall ranges that still overlap.

The only case in which I think this might work is if the data were
clustered by a derived delay value -- but wouldn't it be easier to simply
calculate that value as a column and filter by it directly?

I definitely see the value of column-to-column filters for rows, but in a
table format it doesn't seem to me like it would result in much filtering.
Maybe there's a different use case?

On Thu, Oct 3, 2024 at 11:41 PM Benny Chow <btc...@gmail.com> wrote:

> Assuming the table contained smaller and better correlated files, I think
> a workaround where you materialized the timestamp difference between two
> columns could be effective for data file pruning.  So if a particular
> planned departure date was associated with a lot of delays and the table
> was partitioned by destination_cd and sorted by planned departure date,
> materializing the diff between planned departure date and actual departure
> date will result in a single field with min/max bounds that could be
> filtered on.  You could then get data file pruning for a filter like late
> departure but not more than an hour late.
>
> On Mon, Sep 30, 2024 at 12:56 PM Baldwin, Jennifer
> <jennifer.bald...@teradata.com.invalid> wrote:
>
>> It has come to my attention that there was no attachment.  I have created
>> google doc instead.  Thanks.
>>
>>
>>
>>
>> https://docs.google.com/document/d/1HZa3AyPPfgz9VOVA9rPhJJ8f3F-3tEel_53nIlvYlo0/edit?usp=sharing
>>
>>
>>
>> *From: *Baldwin, Jennifer <jennifer.bald...@teradata.com.INVALID>
>> *Date: *Friday, September 27, 2024 at 12:54 PM
>> *To: *dev@iceberg.apache.org <dev@iceberg.apache.org>
>> *Cc: *jennifer.bald...@teradata.com.invalid
>> <jennifer.bald...@teradata.com.INVALID>
>> *Subject: *Re: [EXTERNAL] Re: [DISCUSS] Column to Column filtering
>>
>> You don't often get email from jennifer.bald...@teradata.com.invalid. Learn
>> why this is important <https://aka.ms/LearnAboutSenderIdentification>
>>
>> Please see attached, I hope this provides you with more clarity on the
>> use case we hope to support.  Let me know if you have any further questions
>>
>>
>>
>> *From: *Russell Spitzer <russell.spit...@gmail.com>
>> *Date: *Wednesday, September 18, 2024 at 6:15 PM
>> *To: *dev@iceberg.apache.org <dev@iceberg.apache.org>
>> *Cc: *jennifer.bald...@teradata.com.invalid
>> <jennifer.bald...@teradata.com.invalid>
>> *Subject: *[EXTERNAL] Re: [DISCUSS] Column to Column filtering
>>
>> [CAUTION: External Email]
>>
>>
>>
>> I have similar concerns to Ryan although I could see that if we were
>> writing smaller and better correlated files that this could be a big help.
>> Specifically with variant use cases this may be very useful. I would love
>> to hear more about the use cases and rationale for adding this. Do you have
>> any specific examples you can go into detail on?
>>
>>
>>
>> On Wed, Sep 18, 2024 at 4:48 PM rdb...@gmail.com <rdb...@gmail.com>
>> wrote:
>>
>> I'm curious to learn more about this feature. Is there a driving use case
>> that you're implementing it for? Are there common situations in which these
>> filters are helpful and selective?
>>
>>
>>
>> My initial impression is that this kind of expression would have limited
>> utility at the table format level. Iceberg tracks column ranges for data
>> files and the primary use case for filtering is to skip data files at the
>> scan planning phase. For a column-to-column comparison, you would only be
>> able to eliminate data files that have non-overlapping ranges. That is, if
>> you're looking for rows where x < y, you can only eliminate a file when
>> max(x) < min(y). To me, it seems unlikely that this would be generic enough
>> to be worth it, but if there are use cases where this can happen and speed
>> up queries I think it may make sense.
>>
>>
>>
>> Ryan
>>
>>
>>
>> On Tue, Sep 17, 2024 at 6:21 AM Baldwin, Jennifer
>> <jennifer.bald...@teradata.com.invalid> wrote:
>>
>> I’m starting a thread to discuss a feature for comparisons using column
>> references on the left and right side of an expression wherever iceberg
>> supports column reference to literal value(s) comparisons.  The use case we
>> want to support is filtering of date columns from a single table.  For
>> instance:
>>
>>
>>
>> select * from travel_table
>>
>> where expected_date > travel_date;
>>
>>
>>
>> select * from travel_table
>>
>> where payment_date <>  due_date;
>>
>>
>>
>>
>>
>> The changes will impact row and scan file filtering.  Impacted jars are
>> iceberg-api, iceberg-core, iceberg-orc and iceberg-parquet.
>>
>>
>>
>> Is this a feature the Iceberg community would be willing to accept?
>>
>>
>>
>> Here is a link to a Draft PR with current changes, Thanks.
>>
>> https://github.com/apache/iceberg/pull/11152
>>
>>
>>
>>

Reply via email to