Since the 1.9.0 release candidate has been created, I would like to resurrect this PR: https://github.com/apache/iceberg/pull/12298 to ensure that we have as long a testing period as possible for it.
To recap, here is what the PR does after the review rounds: - *Created 3 interface classes which are implemented by the file formats* : - ReadBuilder - Builder for reading data from data files - AppenderBuilder - Builder for writing data to data files - ObjectModel - Providing ReadBuilders, and AppenderBuilders for the specific data file format and object model pair - *Updated the Parquet, Avro, ORC implementation* for this interfaces, and deprecated the old reader/writer APIs - *Created interface classes which will be used by the actual readers/writers of the data files*: - AppenderBuilder - Builder for writing a file - DataWriterBuilder - Builder for generating a data file - PositionDeleteWriterBuilder - Builder for generating a position delete file - EqualityDeleteWriterBuilder - Builder for generating an equality delete file - No ReadBuilder here - the file format reader builder is reused - *Created a WriterBuilder class which implements the interfaces above *(AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) based on a provided file format specific AppenderBuilder - *Created an ObjectModelRegistry which stores the available ObjectModels*, and engines and users could request the readers (ReadBuilder) and writers (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) from. - *Created the appropriate ObjectModels*: - GenericObjectModels - for reading and writing Iceberg Records - SparkObjectModels - for reading (vectorized and non-vectorized) and writing Spark InternalRow/ColumnarBatch objects - FlinkObjectModels - for reading and writing Flink RowData objects - An arrow object model is also registered for vectorized reads of Parquet files into Arrow ColumnarBatch objects - *Updated the production code where the reading and writing happens *to use the ObjectModelRegistry and the new reader/writer interfaces to access data files - *Kept the testing code intact* to ensure that the new API/code is not breaking anything The original change was not small, and grew substantially during the review rounds. So if you have questions, or I can do anything to make the review easier, don't hesitate to ask. I am happy to do anything to move this forward. Thanks, Peter Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. márc. 26., Sze, 14:54): > Hi everyone, > > I have updated the File Format API PR ( > https://github.com/apache/iceberg/pull/12298) based on the answers and > review comments. > > I would like to merge this only after the 1.9.0 release so we have more > time finding any issues and solving them before this goes to a release for > the users. > > For this I have updated the deprecation comments accordingly. > I would like to ask you to review the PR, so we iron out any possible > requested changes and be ready for the merge as soon as possible after the > 1.9.0 release. > > Thanks, > Peter > > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. márc. > 21., P, 14:32): > >> Hi Renije, >> >> *> 1. **File format filters* >> *>* >> > Do the filters include both filter expressions from both user query and >> delete filter? >> >> The current discussion is about the filters from the user query. >> >> About the delete filter: >> Based on the suggestions on the PR, I have moved the delete filter out >> from the main API. Created a `SupportsDeleteFilter` interface for it which >> would allow pushing down to the filter to Parquet vectorized readers in >> Spark, as this is the only place where we currently implemented this >> feature. >> >> >> Renjie Liu <liurenjie2...@gmail.com> ezt írta (időpont: 2025. márc. 21., >> P, 14:11): >> >>> Hi, Peter: >>> >>> Thanks for the effort on this. >>> >>> *1. **File format filters* >>> >>> Do the filters include both filter expressions from both user query and >>> delete filter? >>> >>> For filters from user query, I agree with you that we should keep the >>> current behavior. >>> >>> For delete filters associated with data files, at first I thought file >>> format readers should not care about this. But now I realized that maybe we >>> need to also push it to file reader, this is useful when `IS_DELETED` >>> metadata column is not necessary and we could use these filters (position >>> deletes, etc) to further prune data. >>> >>> But anyway, I agree that we could postpone it in follow up pr. >>> >>> 2. *Batch size configuration* >>> >>> I'm leaning toward option 2. >>> >>> 3. *Spark configuration* >>> >>> I'm leaning towards using different configuration objects. >>> >>> >>> >>> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry <peter.vary.apa...@gmail.com> >>> wrote: >>> >>>> Hi Team, >>>> Thanks everyone for the reviews on >>>> https://github.com/apache/iceberg/pull/12298! >>>> I have addressed most of comments, but a few questions still remain >>>> which might merit a bit wider audience: >>>> >>>> 1. We should decide on the expected filtering behavior when the >>>> filters are pushed down to the readers. Currently the filters are >>>> applied >>>> as best effort for the file format readers. Some readers (Avro) just >>>> skip >>>> them altogether. There was a suggestion on the PR that we might enforce >>>> more strict requirements and the readers either reject part of the >>>> filters, >>>> or they could apply them fully. >>>> 2. Batch sizes are currently parameters for the reader builders >>>> which could be set for non-vectorized readers too which could be >>>> confusing. >>>> 3. Currently the Spark batch reader uses different configuration >>>> objects for ParquetBatchReadConf and OrcBatchReadConf as requested by >>>> the >>>> reviewers of the Comet PR. There was a suggestion on the current PR to >>>> use >>>> a common configuration instead. >>>> >>>> I would be interested in hearing your thoughts about these topics. >>>> >>>> My current take: >>>> >>>> 1. *File format filters*: I am leaning towards keeping the current >>>> laninet behavior. Especially since Bloom filters are not able to do a >>>> full >>>> filtering, and are often used as a way to filter out unwanted records. >>>> Another option would be to implement a secondary filtering inside the >>>> file >>>> formats themselves which I think would cause extra complexity, and >>>> possible >>>> code duplication. Whatever the decision here, I would suggest moving >>>> this >>>> out to a next PR as the current changeset is big enough as it is. >>>> 2. *Batch size configuration*: Currently this is the only property >>>> which is different in the batch readers and the non-vectorized readers. >>>> I >>>> see 3 possible solutions: >>>> 1. Create different builders for vectorized and non-vectorized >>>> reads - I don't think the current solution is confusing enough to >>>> worth the >>>> extra class >>>> 2. We could put this to the reader configuration property set - >>>> This could work, but "hide" the possible configuration mode which is >>>> valid >>>> for both Parquet and ORC readers >>>> 3. We could keep things as it is now - I would chose this one, >>>> but I don't have a strong opinion here >>>> 3. *Spark configuration*: TBH, I'm open to bot solution and happy >>>> to move to the direction the community decides on >>>> >>>> Thanks, >>>> Peter >>>> >>>> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. márc. >>>> 14., P, 16:31): >>>> >>>>> Hi Peter >>>>> >>>>> Thanks for the update. I will do a new pass on the PR. >>>>> >>>>> Regards >>>>> JB >>>>> >>>>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry < >>>>> peter.vary.apa...@gmail.com> wrote: >>>>> > >>>>> > Hi Team, >>>>> > I have rebased the File Format API proposal ( >>>>> https://github.com/apache/iceberg/pull/12298) to include the new >>>>> changes needed for the Variant types. I would love to hear your feedback, >>>>> especially Dan and Ryan, as you were the most active during our >>>>> discussions. If I can help in any way to make the review easier, please >>>>> let >>>>> me know. >>>>> > Thanks, >>>>> > Peter >>>>> > >>>>> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. >>>>> febr. 28., P, 17:50): >>>>> >> >>>>> >> Hi everyone, >>>>> >> Thanks for all of the actionable, relevant feedback on the PR ( >>>>> https://github.com/apache/iceberg/pull/12298). >>>>> >> Updated the code to address most of them. Please check if you agree >>>>> with the general approach. >>>>> >> If there is a consensus about the general approach, I could. >>>>> separate out the PR to smaller pieces so we can have an easier time to >>>>> review and merge those step-by-step. >>>>> >> Thanks, >>>>> >> Peter >>>>> >> >>>>> >> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. >>>>> febr. 20., Cs, 14:14): >>>>> >>> >>>>> >>> Hi Peter >>>>> >>> >>>>> >>> sorry for the late reply on this. >>>>> >>> >>>>> >>> I did a pass on the proposal, it's very interesting and well >>>>> written. >>>>> >>> I like the DataFile API and definitely worth to discuss all >>>>> together. >>>>> >>> >>>>> >>> Maybe we can schedule a specific meeting to discuss about DataFile >>>>> API ? >>>>> >>> >>>>> >>> Thoughts ? >>>>> >>> >>>>> >>> Regards >>>>> >>> JB >>>>> >>> >>>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry < >>>>> peter.vary.apa...@gmail.com> wrote: >>>>> >>> > >>>>> >>> > Hi Team, >>>>> >>> > >>>>> >>> > As mentioned earlier on our Community Sync I am exploring the >>>>> possibility to define a FileFormat API for accessing different file >>>>> formats. I have put together a proposal based on my findings. >>>>> >>> > >>>>> >>> > ------------------- >>>>> >>> > Iceberg currently supports 3 different file formats: Avro, >>>>> Parquet, ORC. With the introduction of Iceberg V3 specification many new >>>>> features are added to Iceberg. Some of these features like new column >>>>> types, default values require changes at the file format level. The >>>>> changes >>>>> are added by individual developers with different focus on the different >>>>> file formats. As a result not all of the features are available for every >>>>> supported file format. >>>>> >>> > Also there are emerging file formats like Vortex [1] or Lance >>>>> [2] which either by specialization, or by applying newer research results >>>>> could provide better alternatives for certain use-cases like random access >>>>> for data, or storing ML models. >>>>> >>> > ------------------- >>>>> >>> > >>>>> >>> > Please check the detailed proposal [3] and the google document >>>>> [4], and comment there or reply on the dev list if you have any >>>>> suggestions. >>>>> >>> > >>>>> >>> > Thanks, >>>>> >>> > Peter >>>>> >>> > >>>>> >>> > [1] - https://github.com/spiraldb/vortex >>>>> >>> > [2] - https://lancedb.github.io/lance/ >>>>> >>> > [3] - https://github.com/apache/iceberg/issues/12225 >>>>> >>> > [4] - >>>>> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds >>>>> >>> > >>>>> >>>>