Re: [DISCUSS] FileFormat API proposal

Péter Váry Sat, 05 Apr 2025 10:34:42 -0700

Hi Renije,

*> 1. **File format filters*
*>*
> Do the filters include both filter expressions from both user query and
delete filter?


The current discussion is about the filters from the user query.

About the delete filter:
Based on the suggestions on the PR, I have moved the delete filter out from
the main API. Created a `SupportsDeleteFilter` interface for it which would
allow pushing down to the filter to Parquet vectorized readers in Spark, as
this is the only place where we currently implemented this feature.


Renjie Liu <liurenjie2...@gmail.com> ezt írta (időpont: 2025. márc. 21., P,
14:11):

> Hi, Peter:
>
> Thanks for the effort on this.
>
> *1. **File format filters*
>
> Do the filters include both filter expressions from both user query and
> delete filter?
>
> For filters from user query, I agree with you that we should keep the
> current behavior.
>
> For delete filters associated with data files, at first I thought file
> format readers should not care about this. But now I realized that maybe we
> need to also push it to file reader, this is useful when `IS_DELETED`
> metadata column is not necessary and we could use these filters (position
> deletes, etc) to further prune data.
>
> But anyway, I agree that we could postpone it in follow up pr.
>
> 2. *Batch size configuration*
>
> I'm leaning toward option 2.
>
> 3. *Spark configuration*
>
> I'm leaning towards using different configuration objects.
>
>
>
> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry <peter.vary.apa...@gmail.com>
> wrote:
>
>> Hi Team,
>> Thanks everyone for the reviews on
>> https://github.com/apache/iceberg/pull/12298!
>> I have addressed most of comments, but a few questions still remain which
>> might merit a bit wider audience:
>>
>>    1. We should decide on the expected filtering behavior when the
>>    filters are pushed down to the readers. Currently the filters are applied
>>    as best effort for the file format readers. Some readers (Avro) just skip
>>    them altogether. There was a suggestion on the PR that we might enforce
>>    more strict requirements and the readers either reject part of the 
>> filters,
>>    or they could apply them fully.
>>    2. Batch sizes are currently parameters for the reader builders which
>>    could be set for non-vectorized readers too which could be confusing.
>>    3. Currently the Spark batch reader uses different configuration
>>    objects for ParquetBatchReadConf and OrcBatchReadConf as requested by the
>>    reviewers of the Comet PR. There was a suggestion on the current PR to use
>>    a common configuration instead.
>>
>> I would be interested in hearing your thoughts about these topics.
>>
>> My current take:
>>
>>    1. *File format filters*: I am leaning towards keeping the current
>>    laninet behavior. Especially since Bloom filters are not able to do a full
>>    filtering, and are often used as a way to filter out unwanted records.
>>    Another option would be to implement a secondary filtering inside the file
>>    formats themselves which I think would cause extra complexity, and 
>> possible
>>    code duplication. Whatever the decision here, I would suggest moving this
>>    out to a next PR as the current changeset is big enough as it is.
>>    2. *Batch size configuration*: Currently this is the only property
>>    which is different in the batch readers and the non-vectorized readers. I
>>    see 3 possible solutions:
>>       1. Create different builders for vectorized and non-vectorized
>>       reads - I don't think the current solution is confusing enough to 
>> worth the
>>       extra class
>>       2. We could put this to the reader configuration property set -
>>       This could work, but "hide" the possible configuration mode which is 
>> valid
>>       for both Parquet and ORC readers
>>       3. We could keep things as it is now - I would chose this one, but
>>       I don't have a strong opinion here
>>    3. *Spark configuration*: TBH, I'm open to bot solution and happy to
>>    move to the direction the community decides on
>>
>> Thanks,
>> Peter
>>
>> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. márc.
>> 14., P, 16:31):
>>
>>> Hi Peter
>>>
>>> Thanks for the update. I will do a new pass on the PR.
>>>
>>> Regards
>>> JB
>>>
>>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry <peter.vary.apa...@gmail.com>
>>> wrote:
>>> >
>>> > Hi Team,
>>> > I have rebased the File Format API proposal (
>>> https://github.com/apache/iceberg/pull/12298) to include the new
>>> changes needed for the Variant types. I would love to hear your feedback,
>>> especially Dan and Ryan, as you were the most active during our
>>> discussions. If I can help in any way to make the review easier, please let
>>> me know.
>>> > Thanks,
>>> > Peter
>>> >
>>> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025.
>>> febr. 28., P, 17:50):
>>> >>
>>> >> Hi everyone,
>>> >> Thanks for all of the actionable, relevant feedback on the PR (
>>> https://github.com/apache/iceberg/pull/12298).
>>> >> Updated the code to address most of them. Please check if you agree
>>> with the general approach.
>>> >> If there is a consensus about the general approach, I could. separate
>>> out the PR to smaller pieces so we can have an easier time to review and
>>> merge those step-by-step.
>>> >> Thanks,
>>> >> Peter
>>> >>
>>> >> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025.
>>> febr. 20., Cs, 14:14):
>>> >>>
>>> >>> Hi Peter
>>> >>>
>>> >>> sorry for the late reply on this.
>>> >>>
>>> >>> I did a pass on the proposal, it's very interesting and well written.
>>> >>> I like the DataFile API and definitely worth to discuss all together.
>>> >>>
>>> >>> Maybe we can schedule a specific meeting to discuss about DataFile
>>> API ?
>>> >>>
>>> >>> Thoughts ?
>>> >>>
>>> >>> Regards
>>> >>> JB
>>> >>>
>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry <
>>> peter.vary.apa...@gmail.com> wrote:
>>> >>> >
>>> >>> > Hi Team,
>>> >>> >
>>> >>> > As mentioned earlier on our Community Sync I am exploring the
>>> possibility to define a FileFormat API for accessing different file
>>> formats. I have put together a proposal based on my findings.
>>> >>> >
>>> >>> > -------------------
>>> >>> > Iceberg currently supports 3 different file formats: Avro,
>>> Parquet, ORC. With the introduction of Iceberg V3 specification many new
>>> features are added to Iceberg. Some of these features like new column
>>> types, default values require changes at the file format level. The changes
>>> are added by individual developers with different focus on the different
>>> file formats. As a result not all of the features are available for every
>>> supported file format.
>>> >>> > Also there are emerging file formats like Vortex [1] or Lance [2]
>>> which either by specialization, or by applying newer research results could
>>> provide better alternatives for certain use-cases like random access for
>>> data, or storing ML models.
>>> >>> > -------------------
>>> >>> >
>>> >>> > Please check the detailed proposal [3] and the google document
>>> [4], and comment there or reply on the dev list if you have any suggestions.
>>> >>> >
>>> >>> > Thanks,
>>> >>> > Peter
>>> >>> >
>>> >>> > [1] - https://github.com/spiraldb/vortex
>>> >>> > [2] - https://lancedb.github.io/lance/
>>> >>> > [3] - https://github.com/apache/iceberg/issues/12225
>>> >>> > [4] -
>>> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds
>>> >>> >
>>>
>>

Re: [DISCUSS] FileFormat API proposal

Reply via email to