Hi Renjie, The first one for the proposed new API is here: https://github.com/apache/iceberg/pull/12774 Thanks, Peter
On Wed, Apr 16, 2025, 05:40 Renjie Liu <liurenjie2...@gmail.com> wrote: > Hi, Peter: > > Thanks for the effort. I totally agree with splitting them into smaller > prs to move forward. > > I'm quite interested in this topic, and please ping me in those splitted > prs and I'll help to review. > > On Mon, Apr 14, 2025 at 11:22 PM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > >> Hi Peter >> >> Awesome ! Thank you so much ! >> I will do a new pass. >> >> Regards >> JB >> >> On Fri, Apr 11, 2025 at 3:48 PM Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> > >> > Hi JB, >> > >> > Separated out the proposed interfaces to a new PR: >> https://github.com/apache/iceberg/pull/12774. >> > Reviewers can check that out if they are only interested in how the new >> API would look like. >> > >> > Thanks, >> > Peter >> > >> > Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. ápr. >> 10., Cs, 18:25): >> >> >> >> Hi Peter >> >> >> >> Thanks for the ping about the PR. >> >> >> >> Maybe, to facilitate the review and move forward faster, we should >> >> split the PR in smaller PRs: >> >> - one with the interfaces (ReadBuilder, AppenderBuilder, ObjectModel, >> >> AppenderBuilder, DataWriterBuilder, ...) >> >> - one for each file providers (Parquet, Avro, ORC) >> >> >> >> Thoughts ? I can help on the split if needed. >> >> >> >> Regards >> >> JB >> >> >> >> On Thu, Apr 10, 2025 at 5:16 AM Péter Váry < >> peter.vary.apa...@gmail.com> wrote: >> >> > >> >> > Since the 1.9.0 release candidate has been created, I would like to >> resurrect this PR: https://github.com/apache/iceberg/pull/12298 to >> ensure that we have as long a testing period as possible for it. >> >> > >> >> > To recap, here is what the PR does after the review rounds: >> >> > >> >> > Created 3 interface classes which are implemented by the file >> formats: >> >> > >> >> > ReadBuilder - Builder for reading data from data files >> >> > AppenderBuilder - Builder for writing data to data files >> >> > ObjectModel - Providing ReadBuilders, and AppenderBuilders for the >> specific data file format and object model pair >> >> > >> >> > Updated the Parquet, Avro, ORC implementation for this interfaces, >> and deprecated the old reader/writer APIs >> >> > Created interface classes which will be used by the actual >> readers/writers of the data files: >> >> > >> >> > AppenderBuilder - Builder for writing a file >> >> > DataWriterBuilder - Builder for generating a data file >> >> > PositionDeleteWriterBuilder - Builder for generating a position >> delete file >> >> > EqualityDeleteWriterBuilder - Builder for generating an equality >> delete file >> >> > No ReadBuilder here - the file format reader builder is reused >> >> > >> >> > Created a WriterBuilder class which implements the interfaces above >> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) >> based on a provided file format specific AppenderBuilder >> >> > Created an ObjectModelRegistry which stores the available >> ObjectModels, and engines and users could request the readers (ReadBuilder) >> and writers >> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) >> from. >> >> > Created the appropriate ObjectModels: >> >> > >> >> > GenericObjectModels - for reading and writing Iceberg Records >> >> > SparkObjectModels - for reading (vectorized and non-vectorized) and >> writing Spark InternalRow/ColumnarBatch objects >> >> > FlinkObjectModels - for reading and writing Flink RowData objects >> >> > An arrow object model is also registered for vectorized reads of >> Parquet files into Arrow ColumnarBatch objects >> >> > >> >> > Updated the production code where the reading and writing happens to >> use the ObjectModelRegistry and the new reader/writer interfaces to access >> data files >> >> > Kept the testing code intact to ensure that the new API/code is not >> breaking anything >> >> > >> >> > The original change was not small, and grew substantially during the >> review rounds. So if you have questions, or I can do anything to make the >> review easier, don't hesitate to ask. I am happy to do anything to move >> this forward. >> >> > >> >> > Thanks, >> >> > Peter >> >> > >> >> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. >> márc. 26., Sze, 14:54): >> >> >> >> >> >> Hi everyone, >> >> >> >> >> >> I have updated the File Format API PR ( >> https://github.com/apache/iceberg/pull/12298) based on the answers and >> review comments. >> >> >> >> >> >> I would like to merge this only after the 1.9.0 release so we have >> more time finding any issues and solving them before this goes to a release >> for the users. >> >> >> >> >> >> For this I have updated the deprecation comments accordingly. >> >> >> I would like to ask you to review the PR, so we iron out any >> possible requested changes and be ready for the merge as soon as possible >> after the 1.9.0 release. >> >> >> >> >> >> Thanks, >> >> >> Peter >> >> >> >> >> >> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. >> márc. 21., P, 14:32): >> >> >>> >> >> >>> Hi Renije, >> >> >>> >> >> >>> > 1. File format filters >> >> >>> > >> >> >>> > Do the filters include both filter expressions from both user >> query and delete filter? >> >> >>> >> >> >>> The current discussion is about the filters from the user query. >> >> >>> >> >> >>> About the delete filter: >> >> >>> Based on the suggestions on the PR, I have moved the delete filter >> out from the main API. Created a `SupportsDeleteFilter` interface for it >> which would allow pushing down to the filter to Parquet vectorized readers >> in Spark, as this is the only place where we currently implemented this >> feature. >> >> >>> >> >> >>> >> >> >>> Renjie Liu <liurenjie2...@gmail.com> ezt írta (időpont: 2025. >> márc. 21., P, 14:11): >> >> >>>> >> >> >>>> Hi, Peter: >> >> >>>> >> >> >>>> Thanks for the effort on this. >> >> >>>> >> >> >>>> 1. File format filters >> >> >>>> >> >> >>>> Do the filters include both filter expressions from both user >> query and delete filter? >> >> >>>> >> >> >>>> For filters from user query, I agree with you that we should keep >> the current behavior. >> >> >>>> >> >> >>>> For delete filters associated with data files, at first I thought >> file format readers should not care about this. But now I realized that >> maybe we need to also push it to file reader, this is useful when >> `IS_DELETED` metadata column is not necessary and we could use these >> filters (position deletes, etc) to further prune data. >> >> >>>> >> >> >>>> But anyway, I agree that we could postpone it in follow up pr. >> >> >>>> >> >> >>>> 2. Batch size configuration >> >> >>>> >> >> >>>> I'm leaning toward option 2. >> >> >>>> >> >> >>>> 3. Spark configuration >> >> >>>> >> >> >>>> I'm leaning towards using different configuration objects. >> >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry < >> peter.vary.apa...@gmail.com> wrote: >> >> >>>>> >> >> >>>>> Hi Team, >> >> >>>>> Thanks everyone for the reviews on >> https://github.com/apache/iceberg/pull/12298! >> >> >>>>> I have addressed most of comments, but a few questions still >> remain which might merit a bit wider audience: >> >> >>>>> >> >> >>>>> We should decide on the expected filtering behavior when the >> filters are pushed down to the readers. Currently the filters are applied >> as best effort for the file format readers. Some readers (Avro) just skip >> them altogether. There was a suggestion on the PR that we might enforce >> more strict requirements and the readers either reject part of the filters, >> or they could apply them fully. >> >> >>>>> Batch sizes are currently parameters for the reader builders >> which could be set for non-vectorized readers too which could be confusing. >> >> >>>>> Currently the Spark batch reader uses different configuration >> objects for ParquetBatchReadConf and OrcBatchReadConf as requested by the >> reviewers of the Comet PR. There was a suggestion on the current PR to use >> a common configuration instead. >> >> >>>>> >> >> >>>>> I would be interested in hearing your thoughts about these >> topics. >> >> >>>>> >> >> >>>>> My current take: >> >> >>>>> >> >> >>>>> File format filters: I am leaning towards keeping the current >> laninet behavior. Especially since Bloom filters are not able to do a full >> filtering, and are often used as a way to filter out unwanted records. >> Another option would be to implement a secondary filtering inside the file >> formats themselves which I think would cause extra complexity, and possible >> code duplication. Whatever the decision here, I would suggest moving this >> out to a next PR as the current changeset is big enough as it is. >> >> >>>>> Batch size configuration: Currently this is the only property >> which is different in the batch readers and the non-vectorized readers. I >> see 3 possible solutions: >> >> >>>>> >> >> >>>>> Create different builders for vectorized and non-vectorized >> reads - I don't think the current solution is confusing enough to worth the >> extra class >> >> >>>>> We could put this to the reader configuration property set - >> This could work, but "hide" the possible configuration mode which is valid >> for both Parquet and ORC readers >> >> >>>>> We could keep things as it is now - I would chose this one, but >> I don't have a strong opinion here >> >> >>>>> >> >> >>>>> Spark configuration: TBH, I'm open to bot solution and happy to >> move to the direction the community decides on >> >> >>>>> >> >> >>>>> Thanks, >> >> >>>>> Peter >> >> >>>>> >> >> >>>>> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. >> márc. 14., P, 16:31): >> >> >>>>>> >> >> >>>>>> Hi Peter >> >> >>>>>> >> >> >>>>>> Thanks for the update. I will do a new pass on the PR. >> >> >>>>>> >> >> >>>>>> Regards >> >> >>>>>> JB >> >> >>>>>> >> >> >>>>>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry < >> peter.vary.apa...@gmail.com> wrote: >> >> >>>>>> > >> >> >>>>>> > Hi Team, >> >> >>>>>> > I have rebased the File Format API proposal ( >> https://github.com/apache/iceberg/pull/12298) to include the new changes >> needed for the Variant types. I would love to hear your feedback, >> especially Dan and Ryan, as you were the most active during our >> discussions. If I can help in any way to make the review easier, please let >> me know. >> >> >>>>>> > Thanks, >> >> >>>>>> > Peter >> >> >>>>>> > >> >> >>>>>> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: >> 2025. febr. 28., P, 17:50): >> >> >>>>>> >> >> >> >>>>>> >> Hi everyone, >> >> >>>>>> >> Thanks for all of the actionable, relevant feedback on the >> PR (https://github.com/apache/iceberg/pull/12298). >> >> >>>>>> >> Updated the code to address most of them. Please check if >> you agree with the general approach. >> >> >>>>>> >> If there is a consensus about the general approach, I could. >> separate out the PR to smaller pieces so we can have an easier time to >> review and merge those step-by-step. >> >> >>>>>> >> Thanks, >> >> >>>>>> >> Peter >> >> >>>>>> >> >> >> >>>>>> >> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: >> 2025. febr. 20., Cs, 14:14): >> >> >>>>>> >>> >> >> >>>>>> >>> Hi Peter >> >> >>>>>> >>> >> >> >>>>>> >>> sorry for the late reply on this. >> >> >>>>>> >>> >> >> >>>>>> >>> I did a pass on the proposal, it's very interesting and >> well written. >> >> >>>>>> >>> I like the DataFile API and definitely worth to discuss all >> together. >> >> >>>>>> >>> >> >> >>>>>> >>> Maybe we can schedule a specific meeting to discuss about >> DataFile API ? >> >> >>>>>> >>> >> >> >>>>>> >>> Thoughts ? >> >> >>>>>> >>> >> >> >>>>>> >>> Regards >> >> >>>>>> >>> JB >> >> >>>>>> >>> >> >> >>>>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry < >> peter.vary.apa...@gmail.com> wrote: >> >> >>>>>> >>> > >> >> >>>>>> >>> > Hi Team, >> >> >>>>>> >>> > >> >> >>>>>> >>> > As mentioned earlier on our Community Sync I am exploring >> the possibility to define a FileFormat API for accessing different file >> formats. I have put together a proposal based on my findings. >> >> >>>>>> >>> > >> >> >>>>>> >>> > ------------------- >> >> >>>>>> >>> > Iceberg currently supports 3 different file formats: >> Avro, Parquet, ORC. With the introduction of Iceberg V3 specification many >> new features are added to Iceberg. Some of these features like new column >> types, default values require changes at the file format level. The changes >> are added by individual developers with different focus on the different >> file formats. As a result not all of the features are available for every >> supported file format. >> >> >>>>>> >>> > Also there are emerging file formats like Vortex [1] or >> Lance [2] which either by specialization, or by applying newer research >> results could provide better alternatives for certain use-cases like random >> access for data, or storing ML models. >> >> >>>>>> >>> > ------------------- >> >> >>>>>> >>> > >> >> >>>>>> >>> > Please check the detailed proposal [3] and the google >> document [4], and comment there or reply on the dev list if you have any >> suggestions. >> >> >>>>>> >>> > >> >> >>>>>> >>> > Thanks, >> >> >>>>>> >>> > Peter >> >> >>>>>> >>> > >> >> >>>>>> >>> > [1] - https://github.com/spiraldb/vortex >> >> >>>>>> >>> > [2] - https://lancedb.github.io/lance/ >> >> >>>>>> >>> > [3] - https://github.com/apache/iceberg/issues/12225 >> >> >>>>>> >>> > [4] - >> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds >> >> >>>>>> >>> > >> >