Hi Peter Thanks for the ping about the PR.
Maybe, to facilitate the review and move forward faster, we should split the PR in smaller PRs: - one with the interfaces (ReadBuilder, AppenderBuilder, ObjectModel, AppenderBuilder, DataWriterBuilder, ...) - one for each file providers (Parquet, Avro, ORC) Thoughts ? I can help on the split if needed. Regards JB On Thu, Apr 10, 2025 at 5:16 AM Péter Váry <peter.vary.apa...@gmail.com> wrote: > > Since the 1.9.0 release candidate has been created, I would like to resurrect > this PR: https://github.com/apache/iceberg/pull/12298 to ensure that we have > as long a testing period as possible for it. > > To recap, here is what the PR does after the review rounds: > > Created 3 interface classes which are implemented by the file formats: > > ReadBuilder - Builder for reading data from data files > AppenderBuilder - Builder for writing data to data files > ObjectModel - Providing ReadBuilders, and AppenderBuilders for the specific > data file format and object model pair > > Updated the Parquet, Avro, ORC implementation for this interfaces, and > deprecated the old reader/writer APIs > Created interface classes which will be used by the actual readers/writers of > the data files: > > AppenderBuilder - Builder for writing a file > DataWriterBuilder - Builder for generating a data file > PositionDeleteWriterBuilder - Builder for generating a position delete file > EqualityDeleteWriterBuilder - Builder for generating an equality delete file > No ReadBuilder here - the file format reader builder is reused > > Created a WriterBuilder class which implements the interfaces above > (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) > based on a provided file format specific AppenderBuilder > Created an ObjectModelRegistry which stores the available ObjectModels, and > engines and users could request the readers (ReadBuilder) and writers > (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) > from. > Created the appropriate ObjectModels: > > GenericObjectModels - for reading and writing Iceberg Records > SparkObjectModels - for reading (vectorized and non-vectorized) and writing > Spark InternalRow/ColumnarBatch objects > FlinkObjectModels - for reading and writing Flink RowData objects > An arrow object model is also registered for vectorized reads of Parquet > files into Arrow ColumnarBatch objects > > Updated the production code where the reading and writing happens to use the > ObjectModelRegistry and the new reader/writer interfaces to access data files > Kept the testing code intact to ensure that the new API/code is not breaking > anything > > The original change was not small, and grew substantially during the review > rounds. So if you have questions, or I can do anything to make the review > easier, don't hesitate to ask. I am happy to do anything to move this forward. > > Thanks, > Peter > > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. márc. 26., > Sze, 14:54): >> >> Hi everyone, >> >> I have updated the File Format API PR >> (https://github.com/apache/iceberg/pull/12298) based on the answers and >> review comments. >> >> I would like to merge this only after the 1.9.0 release so we have more time >> finding any issues and solving them before this goes to a release for the >> users. >> >> For this I have updated the deprecation comments accordingly. >> I would like to ask you to review the PR, so we iron out any possible >> requested changes and be ready for the merge as soon as possible after the >> 1.9.0 release. >> >> Thanks, >> Peter >> >> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. márc. 21., >> P, 14:32): >>> >>> Hi Renije, >>> >>> > 1. File format filters >>> > >>> > Do the filters include both filter expressions from both user query and >>> > delete filter? >>> >>> The current discussion is about the filters from the user query. >>> >>> About the delete filter: >>> Based on the suggestions on the PR, I have moved the delete filter out from >>> the main API. Created a `SupportsDeleteFilter` interface for it which would >>> allow pushing down to the filter to Parquet vectorized readers in Spark, as >>> this is the only place where we currently implemented this feature. >>> >>> >>> Renjie Liu <liurenjie2...@gmail.com> ezt írta (időpont: 2025. márc. 21., P, >>> 14:11): >>>> >>>> Hi, Peter: >>>> >>>> Thanks for the effort on this. >>>> >>>> 1. File format filters >>>> >>>> Do the filters include both filter expressions from both user query and >>>> delete filter? >>>> >>>> For filters from user query, I agree with you that we should keep the >>>> current behavior. >>>> >>>> For delete filters associated with data files, at first I thought file >>>> format readers should not care about this. But now I realized that maybe >>>> we need to also push it to file reader, this is useful when `IS_DELETED` >>>> metadata column is not necessary and we could use these filters (position >>>> deletes, etc) to further prune data. >>>> >>>> But anyway, I agree that we could postpone it in follow up pr. >>>> >>>> 2. Batch size configuration >>>> >>>> I'm leaning toward option 2. >>>> >>>> 3. Spark configuration >>>> >>>> I'm leaning towards using different configuration objects. >>>> >>>> >>>> >>>> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry <peter.vary.apa...@gmail.com> >>>> wrote: >>>>> >>>>> Hi Team, >>>>> Thanks everyone for the reviews on >>>>> https://github.com/apache/iceberg/pull/12298! >>>>> I have addressed most of comments, but a few questions still remain which >>>>> might merit a bit wider audience: >>>>> >>>>> We should decide on the expected filtering behavior when the filters are >>>>> pushed down to the readers. Currently the filters are applied as best >>>>> effort for the file format readers. Some readers (Avro) just skip them >>>>> altogether. There was a suggestion on the PR that we might enforce more >>>>> strict requirements and the readers either reject part of the filters, or >>>>> they could apply them fully. >>>>> Batch sizes are currently parameters for the reader builders which could >>>>> be set for non-vectorized readers too which could be confusing. >>>>> Currently the Spark batch reader uses different configuration objects for >>>>> ParquetBatchReadConf and OrcBatchReadConf as requested by the reviewers >>>>> of the Comet PR. There was a suggestion on the current PR to use a common >>>>> configuration instead. >>>>> >>>>> I would be interested in hearing your thoughts about these topics. >>>>> >>>>> My current take: >>>>> >>>>> File format filters: I am leaning towards keeping the current laninet >>>>> behavior. Especially since Bloom filters are not able to do a full >>>>> filtering, and are often used as a way to filter out unwanted records. >>>>> Another option would be to implement a secondary filtering inside the >>>>> file formats themselves which I think would cause extra complexity, and >>>>> possible code duplication. Whatever the decision here, I would suggest >>>>> moving this out to a next PR as the current changeset is big enough as it >>>>> is. >>>>> Batch size configuration: Currently this is the only property which is >>>>> different in the batch readers and the non-vectorized readers. I see 3 >>>>> possible solutions: >>>>> >>>>> Create different builders for vectorized and non-vectorized reads - I >>>>> don't think the current solution is confusing enough to worth the extra >>>>> class >>>>> We could put this to the reader configuration property set - This could >>>>> work, but "hide" the possible configuration mode which is valid for both >>>>> Parquet and ORC readers >>>>> We could keep things as it is now - I would chose this one, but I don't >>>>> have a strong opinion here >>>>> >>>>> Spark configuration: TBH, I'm open to bot solution and happy to move to >>>>> the direction the community decides on >>>>> >>>>> Thanks, >>>>> Peter >>>>> >>>>> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. márc. >>>>> 14., P, 16:31): >>>>>> >>>>>> Hi Peter >>>>>> >>>>>> Thanks for the update. I will do a new pass on the PR. >>>>>> >>>>>> Regards >>>>>> JB >>>>>> >>>>>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry <peter.vary.apa...@gmail.com> >>>>>> wrote: >>>>>> > >>>>>> > Hi Team, >>>>>> > I have rebased the File Format API proposal >>>>>> > (https://github.com/apache/iceberg/pull/12298) to include the new >>>>>> > changes needed for the Variant types. I would love to hear your >>>>>> > feedback, especially Dan and Ryan, as you were the most active during >>>>>> > our discussions. If I can help in any way to make the review easier, >>>>>> > please let me know. >>>>>> > Thanks, >>>>>> > Peter >>>>>> > >>>>>> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. >>>>>> > febr. 28., P, 17:50): >>>>>> >> >>>>>> >> Hi everyone, >>>>>> >> Thanks for all of the actionable, relevant feedback on the PR >>>>>> >> (https://github.com/apache/iceberg/pull/12298). >>>>>> >> Updated the code to address most of them. Please check if you agree >>>>>> >> with the general approach. >>>>>> >> If there is a consensus about the general approach, I could. separate >>>>>> >> out the PR to smaller pieces so we can have an easier time to review >>>>>> >> and merge those step-by-step. >>>>>> >> Thanks, >>>>>> >> Peter >>>>>> >> >>>>>> >> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. >>>>>> >> febr. 20., Cs, 14:14): >>>>>> >>> >>>>>> >>> Hi Peter >>>>>> >>> >>>>>> >>> sorry for the late reply on this. >>>>>> >>> >>>>>> >>> I did a pass on the proposal, it's very interesting and well written. >>>>>> >>> I like the DataFile API and definitely worth to discuss all together. >>>>>> >>> >>>>>> >>> Maybe we can schedule a specific meeting to discuss about DataFile >>>>>> >>> API ? >>>>>> >>> >>>>>> >>> Thoughts ? >>>>>> >>> >>>>>> >>> Regards >>>>>> >>> JB >>>>>> >>> >>>>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry >>>>>> >>> <peter.vary.apa...@gmail.com> wrote: >>>>>> >>> > >>>>>> >>> > Hi Team, >>>>>> >>> > >>>>>> >>> > As mentioned earlier on our Community Sync I am exploring the >>>>>> >>> > possibility to define a FileFormat API for accessing different >>>>>> >>> > file formats. I have put together a proposal based on my findings. >>>>>> >>> > >>>>>> >>> > ------------------- >>>>>> >>> > Iceberg currently supports 3 different file formats: Avro, >>>>>> >>> > Parquet, ORC. With the introduction of Iceberg V3 specification >>>>>> >>> > many new features are added to Iceberg. Some of these features >>>>>> >>> > like new column types, default values require changes at the file >>>>>> >>> > format level. The changes are added by individual developers with >>>>>> >>> > different focus on the different file formats. As a result not all >>>>>> >>> > of the features are available for every supported file format. >>>>>> >>> > Also there are emerging file formats like Vortex [1] or Lance [2] >>>>>> >>> > which either by specialization, or by applying newer research >>>>>> >>> > results could provide better alternatives for certain use-cases >>>>>> >>> > like random access for data, or storing ML models. >>>>>> >>> > ------------------- >>>>>> >>> > >>>>>> >>> > Please check the detailed proposal [3] and the google document >>>>>> >>> > [4], and comment there or reply on the dev list if you have any >>>>>> >>> > suggestions. >>>>>> >>> > >>>>>> >>> > Thanks, >>>>>> >>> > Peter >>>>>> >>> > >>>>>> >>> > [1] - https://github.com/spiraldb/vortex >>>>>> >>> > [2] - https://lancedb.github.io/lance/ >>>>>> >>> > [3] - https://github.com/apache/iceberg/issues/12225 >>>>>> >>> > [4] - >>>>>> >>> > https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds >>>>>> >>> >