Re: [DISCUSS] FileFormat API proposal

Péter Váry Tue, 15 Apr 2025 22:03:31 -0700

Hi Renjie,
The first one for the proposed new API is here:
https://github.com/apache/iceberg/pull/12774
Thanks, Peter


On Wed, Apr 16, 2025, 05:40 Renjie Liu <liurenjie2...@gmail.com> wrote:

> Hi, Peter:
>
> Thanks for the effort. I totally agree with splitting them into smaller
> prs to move forward.
>
> I'm quite interested in this topic, and please ping me in those splitted
> prs and I'll help to review.
>
> On Mon, Apr 14, 2025 at 11:22 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
>> Hi Peter
>>
>> Awesome ! Thank you so much !
>> I will do a new pass.
>>
>> Regards
>> JB
>>
>> On Fri, Apr 11, 2025 at 3:48 PM Péter Váry <peter.vary.apa...@gmail.com>
>> wrote:
>> >
>> > Hi JB,
>> >
>> > Separated out the proposed interfaces to a new PR:
>> https://github.com/apache/iceberg/pull/12774.
>> > Reviewers can check that out if they are only interested in how the new
>> API would look like.
>> >
>> > Thanks,
>> > Peter
>> >
>> > Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. ápr.
>> 10., Cs, 18:25):
>> >>
>> >> Hi Peter
>> >>
>> >> Thanks for the ping about the PR.
>> >>
>> >> Maybe, to facilitate the review and move forward faster, we should
>> >> split the PR in smaller PRs:
>> >> - one with the interfaces (ReadBuilder, AppenderBuilder, ObjectModel,
>> >> AppenderBuilder, DataWriterBuilder, ...)
>> >> - one for each file providers (Parquet, Avro, ORC)
>> >>
>> >> Thoughts ? I can help on the split if needed.
>> >>
>> >> Regards
>> >> JB
>> >>
>> >> On Thu, Apr 10, 2025 at 5:16 AM Péter Váry <
>> peter.vary.apa...@gmail.com> wrote:
>> >> >
>> >> > Since the 1.9.0 release candidate has been created, I would like to
>> resurrect this PR: https://github.com/apache/iceberg/pull/12298 to
>> ensure that we have as long a testing period as possible for it.
>> >> >
>> >> > To recap, here is what the PR does after the review rounds:
>> >> >
>> >> > Created 3 interface classes which are implemented by the file
>> formats:
>> >> >
>> >> > ReadBuilder - Builder for reading data from data files
>> >> > AppenderBuilder - Builder for writing data to data files
>> >> > ObjectModel - Providing ReadBuilders, and AppenderBuilders for the
>> specific data file format and object model pair
>> >> >
>> >> > Updated the Parquet, Avro, ORC implementation for this interfaces,
>> and deprecated the old reader/writer APIs
>> >> > Created interface classes which will be used by the actual
>> readers/writers of the data files:
>> >> >
>> >> > AppenderBuilder - Builder for writing a file
>> >> > DataWriterBuilder - Builder for generating a data file
>> >> > PositionDeleteWriterBuilder - Builder for generating a position
>> delete file
>> >> > EqualityDeleteWriterBuilder - Builder for generating an equality
>> delete file
>> >> > No ReadBuilder here - the file format reader builder is reused
>> >> >
>> >> > Created a WriterBuilder class which implements the interfaces above
>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>> based on a provided file format specific AppenderBuilder
>> >> > Created an ObjectModelRegistry which stores the available
>> ObjectModels, and engines and users could request the readers (ReadBuilder)
>> and writers
>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>> from.
>> >> > Created the appropriate ObjectModels:
>> >> >
>> >> > GenericObjectModels - for reading and writing Iceberg Records
>> >> > SparkObjectModels - for reading (vectorized and non-vectorized) and
>> writing Spark InternalRow/ColumnarBatch objects
>> >> > FlinkObjectModels - for reading and writing Flink RowData objects
>> >> > An arrow object model is also registered for vectorized reads of
>> Parquet files into Arrow ColumnarBatch objects
>> >> >
>> >> > Updated the production code where the reading and writing happens to
>> use the ObjectModelRegistry and the new reader/writer interfaces to access
>> data files
>> >> > Kept the testing code intact to ensure that the new API/code is not
>> breaking anything
>> >> >
>> >> > The original change was not small, and grew substantially during the
>> review rounds. So if you have questions, or I can do anything to make the
>> review easier, don't hesitate to ask. I am happy to do anything to move
>> this forward.
>> >> >
>> >> > Thanks,
>> >> > Peter
>> >> >
>> >> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025.
>> márc. 26., Sze, 14:54):
>> >> >>
>> >> >> Hi everyone,
>> >> >>
>> >> >> I have updated the File Format API PR (
>> https://github.com/apache/iceberg/pull/12298) based on the answers and
>> review comments.
>> >> >>
>> >> >> I would like to merge this only after the 1.9.0 release so we have
>> more time finding any issues and solving them before this goes to a release
>> for the users.
>> >> >>
>> >> >> For this I have updated the deprecation comments accordingly.
>> >> >> I would like to ask you to review the PR, so we iron out any
>> possible requested changes and be ready for the merge as soon as possible
>> after the 1.9.0 release.
>> >> >>
>> >> >> Thanks,
>> >> >> Peter
>> >> >>
>> >> >> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025.
>> márc. 21., P, 14:32):
>> >> >>>
>> >> >>> Hi Renije,
>> >> >>>
>> >> >>> > 1. File format filters
>> >> >>> >
>> >> >>> > Do the filters include both filter expressions from both user
>> query and delete filter?
>> >> >>>
>> >> >>> The current discussion is about the filters from the user query.
>> >> >>>
>> >> >>> About the delete filter:
>> >> >>> Based on the suggestions on the PR, I have moved the delete filter
>> out from the main API. Created a `SupportsDeleteFilter` interface for it
>> which would allow pushing down to the filter to Parquet vectorized readers
>> in Spark, as this is the only place where we currently implemented this
>> feature.
>> >> >>>
>> >> >>>
>> >> >>> Renjie Liu <liurenjie2...@gmail.com> ezt írta (időpont: 2025.
>> márc. 21., P, 14:11):
>> >> >>>>
>> >> >>>> Hi, Peter:
>> >> >>>>
>> >> >>>> Thanks for the effort on this.
>> >> >>>>
>> >> >>>> 1. File format filters
>> >> >>>>
>> >> >>>> Do the filters include both filter expressions from both user
>> query and delete filter?
>> >> >>>>
>> >> >>>> For filters from user query, I agree with you that we should keep
>> the current behavior.
>> >> >>>>
>> >> >>>> For delete filters associated with data files, at first I thought
>> file format readers should not care about this. But now I realized that
>> maybe we need to also push it to file reader, this is useful when
>> `IS_DELETED` metadata column is not necessary and we could use these
>> filters (position deletes, etc) to further prune data.
>> >> >>>>
>> >> >>>> But anyway, I agree that we could postpone it in follow up pr.
>> >> >>>>
>> >> >>>> 2. Batch size configuration
>> >> >>>>
>> >> >>>> I'm leaning toward option 2.
>> >> >>>>
>> >> >>>> 3. Spark configuration
>> >> >>>>
>> >> >>>> I'm leaning towards using different configuration objects.
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry <
>> peter.vary.apa...@gmail.com> wrote:
>> >> >>>>>
>> >> >>>>> Hi Team,
>> >> >>>>> Thanks everyone for the reviews on
>> https://github.com/apache/iceberg/pull/12298!
>> >> >>>>> I have addressed most of comments, but a few questions still
>> remain which might merit a bit wider audience:
>> >> >>>>>
>> >> >>>>> We should decide on the expected filtering behavior when the
>> filters are pushed down to the readers. Currently the filters are applied
>> as best effort for the file format readers. Some readers (Avro) just skip
>> them altogether. There was a suggestion on the PR that we might enforce
>> more strict requirements and the readers either reject part of the filters,
>> or they could apply them fully.
>> >> >>>>> Batch sizes are currently parameters for the reader builders
>> which could be set for non-vectorized readers too which could be confusing.
>> >> >>>>> Currently the Spark batch reader uses different configuration
>> objects for ParquetBatchReadConf and OrcBatchReadConf as requested by the
>> reviewers of the Comet PR. There was a suggestion on the current PR to use
>> a common configuration instead.
>> >> >>>>>
>> >> >>>>> I would be interested in hearing your thoughts about these
>> topics.
>> >> >>>>>
>> >> >>>>> My current take:
>> >> >>>>>
>> >> >>>>> File format filters: I am leaning towards keeping the current
>> laninet behavior. Especially since Bloom filters are not able to do a full
>> filtering, and are often used as a way to filter out unwanted records.
>> Another option would be to implement a secondary filtering inside the file
>> formats themselves which I think would cause extra complexity, and possible
>> code duplication. Whatever the decision here, I would suggest moving this
>> out to a next PR as the current changeset is big enough as it is.
>> >> >>>>> Batch size configuration: Currently this is the only property
>> which is different in the batch readers and the non-vectorized readers. I
>> see 3 possible solutions:
>> >> >>>>>
>> >> >>>>> Create different builders for vectorized and non-vectorized
>> reads - I don't think the current solution is confusing enough to worth the
>> extra class
>> >> >>>>> We could put this to the reader configuration property set -
>> This could work, but "hide" the possible configuration mode which is valid
>> for both Parquet and ORC readers
>> >> >>>>> We could keep things as it is now - I would chose this one, but
>> I don't have a strong opinion here
>> >> >>>>>
>> >> >>>>> Spark configuration: TBH, I'm open to bot solution and happy to
>> move to the direction the community decides on
>> >> >>>>>
>> >> >>>>> Thanks,
>> >> >>>>> Peter
>> >> >>>>>
>> >> >>>>> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025.
>> márc. 14., P, 16:31):
>> >> >>>>>>
>> >> >>>>>> Hi Peter
>> >> >>>>>>
>> >> >>>>>> Thanks for the update. I will do a new pass on the PR.
>> >> >>>>>>
>> >> >>>>>> Regards
>> >> >>>>>> JB
>> >> >>>>>>
>> >> >>>>>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry <
>> peter.vary.apa...@gmail.com> wrote:
>> >> >>>>>> >
>> >> >>>>>> > Hi Team,
>> >> >>>>>> > I have rebased the File Format API proposal (
>> https://github.com/apache/iceberg/pull/12298) to include the new changes
>> needed for the Variant types. I would love to hear your feedback,
>> especially Dan and Ryan, as you were the most active during our
>> discussions. If I can help in any way to make the review easier, please let
>> me know.
>> >> >>>>>> > Thanks,
>> >> >>>>>> > Peter
>> >> >>>>>> >
>> >> >>>>>> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont:
>> 2025. febr. 28., P, 17:50):
>> >> >>>>>> >>
>> >> >>>>>> >> Hi everyone,
>> >> >>>>>> >> Thanks for all of the actionable, relevant feedback on the
>> PR (https://github.com/apache/iceberg/pull/12298).
>> >> >>>>>> >> Updated the code to address most of them. Please check if
>> you agree with the general approach.
>> >> >>>>>> >> If there is a consensus about the general approach, I could.
>> separate out the PR to smaller pieces so we can have an easier time to
>> review and merge those step-by-step.
>> >> >>>>>> >> Thanks,
>> >> >>>>>> >> Peter
>> >> >>>>>> >>
>> >> >>>>>> >> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont:
>> 2025. febr. 20., Cs, 14:14):
>> >> >>>>>> >>>
>> >> >>>>>> >>> Hi Peter
>> >> >>>>>> >>>
>> >> >>>>>> >>> sorry for the late reply on this.
>> >> >>>>>> >>>
>> >> >>>>>> >>> I did a pass on the proposal, it's very interesting and
>> well written.
>> >> >>>>>> >>> I like the DataFile API and definitely worth to discuss all
>> together.
>> >> >>>>>> >>>
>> >> >>>>>> >>> Maybe we can schedule a specific meeting to discuss about
>> DataFile API ?
>> >> >>>>>> >>>
>> >> >>>>>> >>> Thoughts ?
>> >> >>>>>> >>>
>> >> >>>>>> >>> Regards
>> >> >>>>>> >>> JB
>> >> >>>>>> >>>
>> >> >>>>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry <
>> peter.vary.apa...@gmail.com> wrote:
>> >> >>>>>> >>> >
>> >> >>>>>> >>> > Hi Team,
>> >> >>>>>> >>> >
>> >> >>>>>> >>> > As mentioned earlier on our Community Sync I am exploring
>> the possibility to define a FileFormat API for accessing different file
>> formats. I have put together a proposal based on my findings.
>> >> >>>>>> >>> >
>> >> >>>>>> >>> > -------------------
>> >> >>>>>> >>> > Iceberg currently supports 3 different file formats:
>> Avro, Parquet, ORC. With the introduction of Iceberg V3 specification many
>> new features are added to Iceberg. Some of these features like new column
>> types, default values require changes at the file format level. The changes
>> are added by individual developers with different focus on the different
>> file formats. As a result not all of the features are available for every
>> supported file format.
>> >> >>>>>> >>> > Also there are emerging file formats like Vortex [1] or
>> Lance [2] which either by specialization, or by applying newer research
>> results could provide better alternatives for certain use-cases like random
>> access for data, or storing ML models.
>> >> >>>>>> >>> > -------------------
>> >> >>>>>> >>> >
>> >> >>>>>> >>> > Please check the detailed proposal [3] and the google
>> document [4], and comment there or reply on the dev list if you have any
>> suggestions.
>> >> >>>>>> >>> >
>> >> >>>>>> >>> > Thanks,
>> >> >>>>>> >>> > Peter
>> >> >>>>>> >>> >
>> >> >>>>>> >>> > [1] - https://github.com/spiraldb/vortex
>> >> >>>>>> >>> > [2] - https://lancedb.github.io/lance/
>> >> >>>>>> >>> > [3] - https://github.com/apache/iceberg/issues/12225
>> >> >>>>>> >>> > [4] -
>> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds
>> >> >>>>>> >>> >
>>
>

Re: [DISCUSS] FileFormat API proposal

Reply via email to