Re: [DISCUSS] FileFormat API proposal

Renjie Liu Fri, 21 Mar 2025 16:01:48 -0700

Hi, Peter:

Thanks for the effort on this.


*1. **File format filters*

Do the filters include both filter expressions from both user query and
delete filter?

For filters from user query, I agree with you that we should keep the
current behavior.

For delete filters associated with data files, at first I thought file
format readers should not care about this. But now I realized that maybe we
need to also push it to file reader, this is useful when `IS_DELETED`
metadata column is not necessary and we could use these filters (position
deletes, etc) to further prune data.

But anyway, I agree that we could postpone it in follow up pr.

2. *Batch size configuration*

I'm leaning toward option 2.

3. *Spark configuration*

I'm leaning towards using different configuration objects.



On Thu, Mar 20, 2025 at 10:23 PM Péter Váry <peter.vary.apa...@gmail.com>
wrote:

> Hi Team,
> Thanks everyone for the reviews on
> https://github.com/apache/iceberg/pull/12298!
> I have addressed most of comments, but a few questions still remain which
> might merit a bit wider audience:
>
>    1. We should decide on the expected filtering behavior when the
>    filters are pushed down to the readers. Currently the filters are applied
>    as best effort for the file format readers. Some readers (Avro) just skip
>    them altogether. There was a suggestion on the PR that we might enforce
>    more strict requirements and the readers either reject part of the filters,
>    or they could apply them fully.
>    2. Batch sizes are currently parameters for the reader builders which
>    could be set for non-vectorized readers too which could be confusing.
>    3. Currently the Spark batch reader uses different configuration
>    objects for ParquetBatchReadConf and OrcBatchReadConf as requested by the
>    reviewers of the Comet PR. There was a suggestion on the current PR to use
>    a common configuration instead.
>
> I would be interested in hearing your thoughts about these topics.
>
> My current take:
>
>    1. *File format filters*: I am leaning towards keeping the current
>    laninet behavior. Especially since Bloom filters are not able to do a full
>    filtering, and are often used as a way to filter out unwanted records.
>    Another option would be to implement a secondary filtering inside the file
>    formats themselves which I think would cause extra complexity, and possible
>    code duplication. Whatever the decision here, I would suggest moving this
>    out to a next PR as the current changeset is big enough as it is.
>    2. *Batch size configuration*: Currently this is the only property
>    which is different in the batch readers and the non-vectorized readers. I
>    see 3 possible solutions:
>       1. Create different builders for vectorized and non-vectorized
>       reads - I don't think the current solution is confusing enough to worth 
> the
>       extra class
>       2. We could put this to the reader configuration property set -
>       This could work, but "hide" the possible configuration mode which is 
> valid
>       for both Parquet and ORC readers
>       3. We could keep things as it is now - I would chose this one, but
>       I don't have a strong opinion here
>    3. *Spark configuration*: TBH, I'm open to bot solution and happy to
>    move to the direction the community decides on
>
> Thanks,
> Peter
>
> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. márc.
> 14., P, 16:31):
>
>> Hi Peter
>>
>> Thanks for the update. I will do a new pass on the PR.
>>
>> Regards
>> JB
>>
>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry <peter.vary.apa...@gmail.com>
>> wrote:
>> >
>> > Hi Team,
>> > I have rebased the File Format API proposal (
>> https://github.com/apache/iceberg/pull/12298) to include the new changes
>> needed for the Variant types. I would love to hear your feedback,
>> especially Dan and Ryan, as you were the most active during our
>> discussions. If I can help in any way to make the review easier, please let
>> me know.
>> > Thanks,
>> > Peter
>> >
>> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025.
>> febr. 28., P, 17:50):
>> >>
>> >> Hi everyone,
>> >> Thanks for all of the actionable, relevant feedback on the PR (
>> https://github.com/apache/iceberg/pull/12298).
>> >> Updated the code to address most of them. Please check if you agree
>> with the general approach.
>> >> If there is a consensus about the general approach, I could. separate
>> out the PR to smaller pieces so we can have an easier time to review and
>> merge those step-by-step.
>> >> Thanks,
>> >> Peter
>> >>
>> >> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. febr.
>> 20., Cs, 14:14):
>> >>>
>> >>> Hi Peter
>> >>>
>> >>> sorry for the late reply on this.
>> >>>
>> >>> I did a pass on the proposal, it's very interesting and well written.
>> >>> I like the DataFile API and definitely worth to discuss all together.
>> >>>
>> >>> Maybe we can schedule a specific meeting to discuss about DataFile
>> API ?
>> >>>
>> >>> Thoughts ?
>> >>>
>> >>> Regards
>> >>> JB
>> >>>
>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry <
>> peter.vary.apa...@gmail.com> wrote:
>> >>> >
>> >>> > Hi Team,
>> >>> >
>> >>> > As mentioned earlier on our Community Sync I am exploring the
>> possibility to define a FileFormat API for accessing different file
>> formats. I have put together a proposal based on my findings.
>> >>> >
>> >>> > -------------------
>> >>> > Iceberg currently supports 3 different file formats: Avro, Parquet,
>> ORC. With the introduction of Iceberg V3 specification many new features
>> are added to Iceberg. Some of these features like new column types, default
>> values require changes at the file format level. The changes are added by
>> individual developers with different focus on the different file formats.
>> As a result not all of the features are available for every supported file
>> format.
>> >>> > Also there are emerging file formats like Vortex [1] or Lance [2]
>> which either by specialization, or by applying newer research results could
>> provide better alternatives for certain use-cases like random access for
>> data, or storing ML models.
>> >>> > -------------------
>> >>> >
>> >>> > Please check the detailed proposal [3] and the google document [4],
>> and comment there or reply on the dev list if you have any suggestions.
>> >>> >
>> >>> > Thanks,
>> >>> > Peter
>> >>> >
>> >>> > [1] - https://github.com/spiraldb/vortex
>> >>> > [2] - https://lancedb.github.io/lance/
>> >>> > [3] - https://github.com/apache/iceberg/issues/12225
>> >>> > [4] -
>> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds
>> >>> >
>>
>

Re: [DISCUSS] FileFormat API proposal

Reply via email to