Re: [DISCUSS] FileFormat API proposal

Péter Váry Fri, 11 Apr 2025 06:48:59 -0700

Hi JB,

Separated out the proposed interfaces to a new PR:
https://github.com/apache/iceberg/pull/12774.
Reviewers can check that out if they are only interested in how the new API
would look like.


Thanks,
Peter

Jean-Baptiste Onofré <[email protected]> ezt írta (időpont: 2025. ápr. 10.,
Cs, 18:25):

> Hi Peter
>
> Thanks for the ping about the PR.
>
> Maybe, to facilitate the review and move forward faster, we should
> split the PR in smaller PRs:
> - one with the interfaces (ReadBuilder, AppenderBuilder, ObjectModel,
> AppenderBuilder, DataWriterBuilder, ...)
> - one for each file providers (Parquet, Avro, ORC)
>
> Thoughts ? I can help on the split if needed.
>
> Regards
> JB
>
> On Thu, Apr 10, 2025 at 5:16 AM Péter Váry <[email protected]>
> wrote:
> >
> > Since the 1.9.0 release candidate has been created, I would like to
> resurrect this PR: https://github.com/apache/iceberg/pull/12298 to ensure
> that we have as long a testing period as possible for it.
> >
> > To recap, here is what the PR does after the review rounds:
> >
> > Created 3 interface classes which are implemented by the file formats:
> >
> > ReadBuilder - Builder for reading data from data files
> > AppenderBuilder - Builder for writing data to data files
> > ObjectModel - Providing ReadBuilders, and AppenderBuilders for the
> specific data file format and object model pair
> >
> > Updated the Parquet, Avro, ORC implementation for this interfaces, and
> deprecated the old reader/writer APIs
> > Created interface classes which will be used by the actual
> readers/writers of the data files:
> >
> > AppenderBuilder - Builder for writing a file
> > DataWriterBuilder - Builder for generating a data file
> > PositionDeleteWriterBuilder - Builder for generating a position delete
> file
> > EqualityDeleteWriterBuilder - Builder for generating an equality delete
> file
> > No ReadBuilder here - the file format reader builder is reused
> >
> > Created a WriterBuilder class which implements the interfaces above
> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
> based on a provided file format specific AppenderBuilder
> > Created an ObjectModelRegistry which stores the available ObjectModels,
> and engines and users could request the readers (ReadBuilder) and writers
> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
> from.
> > Created the appropriate ObjectModels:
> >
> > GenericObjectModels - for reading and writing Iceberg Records
> > SparkObjectModels - for reading (vectorized and non-vectorized) and
> writing Spark InternalRow/ColumnarBatch objects
> > FlinkObjectModels - for reading and writing Flink RowData objects
> > An arrow object model is also registered for vectorized reads of Parquet
> files into Arrow ColumnarBatch objects
> >
> > Updated the production code where the reading and writing happens to use
> the ObjectModelRegistry and the new reader/writer interfaces to access data
> files
> > Kept the testing code intact to ensure that the new API/code is not
> breaking anything
> >
> > The original change was not small, and grew substantially during the
> review rounds. So if you have questions, or I can do anything to make the
> review easier, don't hesitate to ask. I am happy to do anything to move
> this forward.
> >
> > Thanks,
> > Peter
> >
> > Péter Váry <[email protected]> ezt írta (időpont: 2025. márc.
> 26., Sze, 14:54):
> >>
> >> Hi everyone,
> >>
> >> I have updated the File Format API PR (
> https://github.com/apache/iceberg/pull/12298) based on the answers and
> review comments.
> >>
> >> I would like to merge this only after the 1.9.0 release so we have more
> time finding any issues and solving them before this goes to a release for
> the users.
> >>
> >> For this I have updated the deprecation comments accordingly.
> >> I would like to ask you to review the PR, so we iron out any possible
> requested changes and be ready for the merge as soon as possible after the
> 1.9.0 release.
> >>
> >> Thanks,
> >> Peter
> >>
> >> Péter Váry <[email protected]> ezt írta (időpont: 2025.
> márc. 21., P, 14:32):
> >>>
> >>> Hi Renije,
> >>>
> >>> > 1. File format filters
> >>> >
> >>> > Do the filters include both filter expressions from both user query
> and delete filter?
> >>>
> >>> The current discussion is about the filters from the user query.
> >>>
> >>> About the delete filter:
> >>> Based on the suggestions on the PR, I have moved the delete filter out
> from the main API. Created a `SupportsDeleteFilter` interface for it which
> would allow pushing down to the filter to Parquet vectorized readers in
> Spark, as this is the only place where we currently implemented this
> feature.
> >>>
> >>>
> >>> Renjie Liu <[email protected]> ezt írta (időpont: 2025. márc.
> 21., P, 14:11):
> >>>>
> >>>> Hi, Peter:
> >>>>
> >>>> Thanks for the effort on this.
> >>>>
> >>>> 1. File format filters
> >>>>
> >>>> Do the filters include both filter expressions from both user query
> and delete filter?
> >>>>
> >>>> For filters from user query, I agree with you that we should keep the
> current behavior.
> >>>>
> >>>> For delete filters associated with data files, at first I thought
> file format readers should not care about this. But now I realized that
> maybe we need to also push it to file reader, this is useful when
> `IS_DELETED` metadata column is not necessary and we could use these
> filters (position deletes, etc) to further prune data.
> >>>>
> >>>> But anyway, I agree that we could postpone it in follow up pr.
> >>>>
> >>>> 2. Batch size configuration
> >>>>
> >>>> I'm leaning toward option 2.
> >>>>
> >>>> 3. Spark configuration
> >>>>
> >>>> I'm leaning towards using different configuration objects.
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry <
> [email protected]> wrote:
> >>>>>
> >>>>> Hi Team,
> >>>>> Thanks everyone for the reviews on
> https://github.com/apache/iceberg/pull/12298!
> >>>>> I have addressed most of comments, but a few questions still remain
> which might merit a bit wider audience:
> >>>>>
> >>>>> We should decide on the expected filtering behavior when the filters
> are pushed down to the readers. Currently the filters are applied as best
> effort for the file format readers. Some readers (Avro) just skip them
> altogether. There was a suggestion on the PR that we might enforce more
> strict requirements and the readers either reject part of the filters, or
> they could apply them fully.
> >>>>> Batch sizes are currently parameters for the reader builders which
> could be set for non-vectorized readers too which could be confusing.
> >>>>> Currently the Spark batch reader uses different configuration
> objects for ParquetBatchReadConf and OrcBatchReadConf as requested by the
> reviewers of the Comet PR. There was a suggestion on the current PR to use
> a common configuration instead.
> >>>>>
> >>>>> I would be interested in hearing your thoughts about these topics.
> >>>>>
> >>>>> My current take:
> >>>>>
> >>>>> File format filters: I am leaning towards keeping the current
> laninet behavior. Especially since Bloom filters are not able to do a full
> filtering, and are often used as a way to filter out unwanted records.
> Another option would be to implement a secondary filtering inside the file
> formats themselves which I think would cause extra complexity, and possible
> code duplication. Whatever the decision here, I would suggest moving this
> out to a next PR as the current changeset is big enough as it is.
> >>>>> Batch size configuration: Currently this is the only property which
> is different in the batch readers and the non-vectorized readers. I see 3
> possible solutions:
> >>>>>
> >>>>> Create different builders for vectorized and non-vectorized reads -
> I don't think the current solution is confusing enough to worth the extra
> class
> >>>>> We could put this to the reader configuration property set - This
> could work, but "hide" the possible configuration mode which is valid for
> both Parquet and ORC readers
> >>>>> We could keep things as it is now - I would chose this one, but I
> don't have a strong opinion here
> >>>>>
> >>>>> Spark configuration: TBH, I'm open to bot solution and happy to move
> to the direction the community decides on
> >>>>>
> >>>>> Thanks,
> >>>>> Peter
> >>>>>
> >>>>> Jean-Baptiste Onofré <[email protected]> ezt írta (időpont: 2025.
> márc. 14., P, 16:31):
> >>>>>>
> >>>>>> Hi Peter
> >>>>>>
> >>>>>> Thanks for the update. I will do a new pass on the PR.
> >>>>>>
> >>>>>> Regards
> >>>>>> JB
> >>>>>>
> >>>>>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry <
> [email protected]> wrote:
> >>>>>> >
> >>>>>> > Hi Team,
> >>>>>> > I have rebased the File Format API proposal (
> https://github.com/apache/iceberg/pull/12298) to include the new changes
> needed for the Variant types. I would love to hear your feedback,
> especially Dan and Ryan, as you were the most active during our
> discussions. If I can help in any way to make the review easier, please let
> me know.
> >>>>>> > Thanks,
> >>>>>> > Peter
> >>>>>> >
> >>>>>> > Péter Váry <[email protected]> ezt írta (időpont:
> 2025. febr. 28., P, 17:50):
> >>>>>> >>
> >>>>>> >> Hi everyone,
> >>>>>> >> Thanks for all of the actionable, relevant feedback on the PR (
> https://github.com/apache/iceberg/pull/12298).
> >>>>>> >> Updated the code to address most of them. Please check if you
> agree with the general approach.
> >>>>>> >> If there is a consensus about the general approach, I could.
> separate out the PR to smaller pieces so we can have an easier time to
> review and merge those step-by-step.
> >>>>>> >> Thanks,
> >>>>>> >> Peter
> >>>>>> >>
> >>>>>> >> Jean-Baptiste Onofré <[email protected]> ezt írta (időpont: 2025.
> febr. 20., Cs, 14:14):
> >>>>>> >>>
> >>>>>> >>> Hi Peter
> >>>>>> >>>
> >>>>>> >>> sorry for the late reply on this.
> >>>>>> >>>
> >>>>>> >>> I did a pass on the proposal, it's very interesting and well
> written.
> >>>>>> >>> I like the DataFile API and definitely worth to discuss all
> together.
> >>>>>> >>>
> >>>>>> >>> Maybe we can schedule a specific meeting to discuss about
> DataFile API ?
> >>>>>> >>>
> >>>>>> >>> Thoughts ?
> >>>>>> >>>
> >>>>>> >>> Regards
> >>>>>> >>> JB
> >>>>>> >>>
> >>>>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry <
> [email protected]> wrote:
> >>>>>> >>> >
> >>>>>> >>> > Hi Team,
> >>>>>> >>> >
> >>>>>> >>> > As mentioned earlier on our Community Sync I am exploring the
> possibility to define a FileFormat API for accessing different file
> formats. I have put together a proposal based on my findings.
> >>>>>> >>> >
> >>>>>> >>> > -------------------
> >>>>>> >>> > Iceberg currently supports 3 different file formats: Avro,
> Parquet, ORC. With the introduction of Iceberg V3 specification many new
> features are added to Iceberg. Some of these features like new column
> types, default values require changes at the file format level. The changes
> are added by individual developers with different focus on the different
> file formats. As a result not all of the features are available for every
> supported file format.
> >>>>>> >>> > Also there are emerging file formats like Vortex [1] or Lance
> [2] which either by specialization, or by applying newer research results
> could provide better alternatives for certain use-cases like random access
> for data, or storing ML models.
> >>>>>> >>> > -------------------
> >>>>>> >>> >
> >>>>>> >>> > Please check the detailed proposal [3] and the google
> document [4], and comment there or reply on the dev list if you have any
> suggestions.
> >>>>>> >>> >
> >>>>>> >>> > Thanks,
> >>>>>> >>> > Peter
> >>>>>> >>> >
> >>>>>> >>> > [1] - https://github.com/spiraldb/vortex
> >>>>>> >>> > [2] - https://lancedb.github.io/lance/
> >>>>>> >>> > [3] - https://github.com/apache/iceberg/issues/12225
> >>>>>> >>> > [4] -
> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds
> >>>>>> >>> >
>

Re: [DISCUSS] FileFormat API proposal

Reply via email to