Re: [DISCUSS] FileFormat API proposal

Renjie Liu Tue, 15 Apr 2025 20:40:27 -0700

Hi, Peter:

Thanks for the effort. I totally agree with splitting them into smaller prs
to move forward.


I'm quite interested in this topic, and please ping me in those splitted
prs and I'll help to review.

On Mon, Apr 14, 2025 at 11:22 PM Jean-Baptiste Onofré <[email protected]>
wrote:

> Hi Peter
>
> Awesome ! Thank you so much !
> I will do a new pass.
>
> Regards
> JB
>
> On Fri, Apr 11, 2025 at 3:48 PM Péter Váry <[email protected]>
> wrote:
> >
> > Hi JB,
> >
> > Separated out the proposed interfaces to a new PR:
> https://github.com/apache/iceberg/pull/12774.
> > Reviewers can check that out if they are only interested in how the new
> API would look like.
> >
> > Thanks,
> > Peter
> >
> > Jean-Baptiste Onofré <[email protected]> ezt írta (időpont: 2025. ápr.
> 10., Cs, 18:25):
> >>
> >> Hi Peter
> >>
> >> Thanks for the ping about the PR.
> >>
> >> Maybe, to facilitate the review and move forward faster, we should
> >> split the PR in smaller PRs:
> >> - one with the interfaces (ReadBuilder, AppenderBuilder, ObjectModel,
> >> AppenderBuilder, DataWriterBuilder, ...)
> >> - one for each file providers (Parquet, Avro, ORC)
> >>
> >> Thoughts ? I can help on the split if needed.
> >>
> >> Regards
> >> JB
> >>
> >> On Thu, Apr 10, 2025 at 5:16 AM Péter Váry <[email protected]>
> wrote:
> >> >
> >> > Since the 1.9.0 release candidate has been created, I would like to
> resurrect this PR: https://github.com/apache/iceberg/pull/12298 to ensure
> that we have as long a testing period as possible for it.
> >> >
> >> > To recap, here is what the PR does after the review rounds:
> >> >
> >> > Created 3 interface classes which are implemented by the file formats:
> >> >
> >> > ReadBuilder - Builder for reading data from data files
> >> > AppenderBuilder - Builder for writing data to data files
> >> > ObjectModel - Providing ReadBuilders, and AppenderBuilders for the
> specific data file format and object model pair
> >> >
> >> > Updated the Parquet, Avro, ORC implementation for this interfaces,
> and deprecated the old reader/writer APIs
> >> > Created interface classes which will be used by the actual
> readers/writers of the data files:
> >> >
> >> > AppenderBuilder - Builder for writing a file
> >> > DataWriterBuilder - Builder for generating a data file
> >> > PositionDeleteWriterBuilder - Builder for generating a position
> delete file
> >> > EqualityDeleteWriterBuilder - Builder for generating an equality
> delete file
> >> > No ReadBuilder here - the file format reader builder is reused
> >> >
> >> > Created a WriterBuilder class which implements the interfaces above
> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
> based on a provided file format specific AppenderBuilder
> >> > Created an ObjectModelRegistry which stores the available
> ObjectModels, and engines and users could request the readers (ReadBuilder)
> and writers
> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
> from.
> >> > Created the appropriate ObjectModels:
> >> >
> >> > GenericObjectModels - for reading and writing Iceberg Records
> >> > SparkObjectModels - for reading (vectorized and non-vectorized) and
> writing Spark InternalRow/ColumnarBatch objects
> >> > FlinkObjectModels - for reading and writing Flink RowData objects
> >> > An arrow object model is also registered for vectorized reads of
> Parquet files into Arrow ColumnarBatch objects
> >> >
> >> > Updated the production code where the reading and writing happens to
> use the ObjectModelRegistry and the new reader/writer interfaces to access
> data files
> >> > Kept the testing code intact to ensure that the new API/code is not
> breaking anything
> >> >
> >> > The original change was not small, and grew substantially during the
> review rounds. So if you have questions, or I can do anything to make the
> review easier, don't hesitate to ask. I am happy to do anything to move
> this forward.
> >> >
> >> > Thanks,
> >> > Peter
> >> >
> >> > Péter Váry <[email protected]> ezt írta (időpont: 2025.
> márc. 26., Sze, 14:54):
> >> >>
> >> >> Hi everyone,
> >> >>
> >> >> I have updated the File Format API PR (
> https://github.com/apache/iceberg/pull/12298) based on the answers and
> review comments.
> >> >>
> >> >> I would like to merge this only after the 1.9.0 release so we have
> more time finding any issues and solving them before this goes to a release
> for the users.
> >> >>
> >> >> For this I have updated the deprecation comments accordingly.
> >> >> I would like to ask you to review the PR, so we iron out any
> possible requested changes and be ready for the merge as soon as possible
> after the 1.9.0 release.
> >> >>
> >> >> Thanks,
> >> >> Peter
> >> >>
> >> >> Péter Váry <[email protected]> ezt írta (időpont: 2025.
> márc. 21., P, 14:32):
> >> >>>
> >> >>> Hi Renije,
> >> >>>
> >> >>> > 1. File format filters
> >> >>> >
> >> >>> > Do the filters include both filter expressions from both user
> query and delete filter?
> >> >>>
> >> >>> The current discussion is about the filters from the user query.
> >> >>>
> >> >>> About the delete filter:
> >> >>> Based on the suggestions on the PR, I have moved the delete filter
> out from the main API. Created a `SupportsDeleteFilter` interface for it
> which would allow pushing down to the filter to Parquet vectorized readers
> in Spark, as this is the only place where we currently implemented this
> feature.
> >> >>>
> >> >>>
> >> >>> Renjie Liu <[email protected]> ezt írta (időpont: 2025.
> márc. 21., P, 14:11):
> >> >>>>
> >> >>>> Hi, Peter:
> >> >>>>
> >> >>>> Thanks for the effort on this.
> >> >>>>
> >> >>>> 1. File format filters
> >> >>>>
> >> >>>> Do the filters include both filter expressions from both user
> query and delete filter?
> >> >>>>
> >> >>>> For filters from user query, I agree with you that we should keep
> the current behavior.
> >> >>>>
> >> >>>> For delete filters associated with data files, at first I thought
> file format readers should not care about this. But now I realized that
> maybe we need to also push it to file reader, this is useful when
> `IS_DELETED` metadata column is not necessary and we could use these
> filters (position deletes, etc) to further prune data.
> >> >>>>
> >> >>>> But anyway, I agree that we could postpone it in follow up pr.
> >> >>>>
> >> >>>> 2. Batch size configuration
> >> >>>>
> >> >>>> I'm leaning toward option 2.
> >> >>>>
> >> >>>> 3. Spark configuration
> >> >>>>
> >> >>>> I'm leaning towards using different configuration objects.
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry <
> [email protected]> wrote:
> >> >>>>>
> >> >>>>> Hi Team,
> >> >>>>> Thanks everyone for the reviews on
> https://github.com/apache/iceberg/pull/12298!
> >> >>>>> I have addressed most of comments, but a few questions still
> remain which might merit a bit wider audience:
> >> >>>>>
> >> >>>>> We should decide on the expected filtering behavior when the
> filters are pushed down to the readers. Currently the filters are applied
> as best effort for the file format readers. Some readers (Avro) just skip
> them altogether. There was a suggestion on the PR that we might enforce
> more strict requirements and the readers either reject part of the filters,
> or they could apply them fully.
> >> >>>>> Batch sizes are currently parameters for the reader builders
> which could be set for non-vectorized readers too which could be confusing.
> >> >>>>> Currently the Spark batch reader uses different configuration
> objects for ParquetBatchReadConf and OrcBatchReadConf as requested by the
> reviewers of the Comet PR. There was a suggestion on the current PR to use
> a common configuration instead.
> >> >>>>>
> >> >>>>> I would be interested in hearing your thoughts about these topics.
> >> >>>>>
> >> >>>>> My current take:
> >> >>>>>
> >> >>>>> File format filters: I am leaning towards keeping the current
> laninet behavior. Especially since Bloom filters are not able to do a full
> filtering, and are often used as a way to filter out unwanted records.
> Another option would be to implement a secondary filtering inside the file
> formats themselves which I think would cause extra complexity, and possible
> code duplication. Whatever the decision here, I would suggest moving this
> out to a next PR as the current changeset is big enough as it is.
> >> >>>>> Batch size configuration: Currently this is the only property
> which is different in the batch readers and the non-vectorized readers. I
> see 3 possible solutions:
> >> >>>>>
> >> >>>>> Create different builders for vectorized and non-vectorized reads
> - I don't think the current solution is confusing enough to worth the extra
> class
> >> >>>>> We could put this to the reader configuration property set - This
> could work, but "hide" the possible configuration mode which is valid for
> both Parquet and ORC readers
> >> >>>>> We could keep things as it is now - I would chose this one, but I
> don't have a strong opinion here
> >> >>>>>
> >> >>>>> Spark configuration: TBH, I'm open to bot solution and happy to
> move to the direction the community decides on
> >> >>>>>
> >> >>>>> Thanks,
> >> >>>>> Peter
> >> >>>>>
> >> >>>>> Jean-Baptiste Onofré <[email protected]> ezt írta (időpont: 2025.
> márc. 14., P, 16:31):
> >> >>>>>>
> >> >>>>>> Hi Peter
> >> >>>>>>
> >> >>>>>> Thanks for the update. I will do a new pass on the PR.
> >> >>>>>>
> >> >>>>>> Regards
> >> >>>>>> JB
> >> >>>>>>
> >> >>>>>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry <
> [email protected]> wrote:
> >> >>>>>> >
> >> >>>>>> > Hi Team,
> >> >>>>>> > I have rebased the File Format API proposal (
> https://github.com/apache/iceberg/pull/12298) to include the new changes
> needed for the Variant types. I would love to hear your feedback,
> especially Dan and Ryan, as you were the most active during our
> discussions. If I can help in any way to make the review easier, please let
> me know.
> >> >>>>>> > Thanks,
> >> >>>>>> > Peter
> >> >>>>>> >
> >> >>>>>> > Péter Váry <[email protected]> ezt írta (időpont:
> 2025. febr. 28., P, 17:50):
> >> >>>>>> >>
> >> >>>>>> >> Hi everyone,
> >> >>>>>> >> Thanks for all of the actionable, relevant feedback on the PR
> (https://github.com/apache/iceberg/pull/12298).
> >> >>>>>> >> Updated the code to address most of them. Please check if you
> agree with the general approach.
> >> >>>>>> >> If there is a consensus about the general approach, I could.
> separate out the PR to smaller pieces so we can have an easier time to
> review and merge those step-by-step.
> >> >>>>>> >> Thanks,
> >> >>>>>> >> Peter
> >> >>>>>> >>
> >> >>>>>> >> Jean-Baptiste Onofré <[email protected]> ezt írta (időpont:
> 2025. febr. 20., Cs, 14:14):
> >> >>>>>> >>>
> >> >>>>>> >>> Hi Peter
> >> >>>>>> >>>
> >> >>>>>> >>> sorry for the late reply on this.
> >> >>>>>> >>>
> >> >>>>>> >>> I did a pass on the proposal, it's very interesting and well
> written.
> >> >>>>>> >>> I like the DataFile API and definitely worth to discuss all
> together.
> >> >>>>>> >>>
> >> >>>>>> >>> Maybe we can schedule a specific meeting to discuss about
> DataFile API ?
> >> >>>>>> >>>
> >> >>>>>> >>> Thoughts ?
> >> >>>>>> >>>
> >> >>>>>> >>> Regards
> >> >>>>>> >>> JB
> >> >>>>>> >>>
> >> >>>>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry <
> [email protected]> wrote:
> >> >>>>>> >>> >
> >> >>>>>> >>> > Hi Team,
> >> >>>>>> >>> >
> >> >>>>>> >>> > As mentioned earlier on our Community Sync I am exploring
> the possibility to define a FileFormat API for accessing different file
> formats. I have put together a proposal based on my findings.
> >> >>>>>> >>> >
> >> >>>>>> >>> > -------------------
> >> >>>>>> >>> > Iceberg currently supports 3 different file formats: Avro,
> Parquet, ORC. With the introduction of Iceberg V3 specification many new
> features are added to Iceberg. Some of these features like new column
> types, default values require changes at the file format level. The changes
> are added by individual developers with different focus on the different
> file formats. As a result not all of the features are available for every
> supported file format.
> >> >>>>>> >>> > Also there are emerging file formats like Vortex [1] or
> Lance [2] which either by specialization, or by applying newer research
> results could provide better alternatives for certain use-cases like random
> access for data, or storing ML models.
> >> >>>>>> >>> > -------------------
> >> >>>>>> >>> >
> >> >>>>>> >>> > Please check the detailed proposal [3] and the google
> document [4], and comment there or reply on the dev list if you have any
> suggestions.
> >> >>>>>> >>> >
> >> >>>>>> >>> > Thanks,
> >> >>>>>> >>> > Peter
> >> >>>>>> >>> >
> >> >>>>>> >>> > [1] - https://github.com/spiraldb/vortex
> >> >>>>>> >>> > [2] - https://lancedb.github.io/lance/
> >> >>>>>> >>> > [3] - https://github.com/apache/iceberg/issues/12225
> >> >>>>>> >>> > [4] -
> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds
> >> >>>>>> >>> >
>

Re: [DISCUSS] FileFormat API proposal

Reply via email to