Hi Peter Awesome ! Thank you so much ! I will do a new pass.
Regards JB On Fri, Apr 11, 2025 at 3:48 PM Péter Váry <peter.vary.apa...@gmail.com> wrote: > > Hi JB, > > Separated out the proposed interfaces to a new PR: > https://github.com/apache/iceberg/pull/12774. > Reviewers can check that out if they are only interested in how the new API > would look like. > > Thanks, > Peter > > Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. ápr. 10., > Cs, 18:25): >> >> Hi Peter >> >> Thanks for the ping about the PR. >> >> Maybe, to facilitate the review and move forward faster, we should >> split the PR in smaller PRs: >> - one with the interfaces (ReadBuilder, AppenderBuilder, ObjectModel, >> AppenderBuilder, DataWriterBuilder, ...) >> - one for each file providers (Parquet, Avro, ORC) >> >> Thoughts ? I can help on the split if needed. >> >> Regards >> JB >> >> On Thu, Apr 10, 2025 at 5:16 AM Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> > >> > Since the 1.9.0 release candidate has been created, I would like to >> > resurrect this PR: https://github.com/apache/iceberg/pull/12298 to ensure >> > that we have as long a testing period as possible for it. >> > >> > To recap, here is what the PR does after the review rounds: >> > >> > Created 3 interface classes which are implemented by the file formats: >> > >> > ReadBuilder - Builder for reading data from data files >> > AppenderBuilder - Builder for writing data to data files >> > ObjectModel - Providing ReadBuilders, and AppenderBuilders for the >> > specific data file format and object model pair >> > >> > Updated the Parquet, Avro, ORC implementation for this interfaces, and >> > deprecated the old reader/writer APIs >> > Created interface classes which will be used by the actual readers/writers >> > of the data files: >> > >> > AppenderBuilder - Builder for writing a file >> > DataWriterBuilder - Builder for generating a data file >> > PositionDeleteWriterBuilder - Builder for generating a position delete file >> > EqualityDeleteWriterBuilder - Builder for generating an equality delete >> > file >> > No ReadBuilder here - the file format reader builder is reused >> > >> > Created a WriterBuilder class which implements the interfaces above >> > (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) >> > based on a provided file format specific AppenderBuilder >> > Created an ObjectModelRegistry which stores the available ObjectModels, >> > and engines and users could request the readers (ReadBuilder) and writers >> > (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) >> > from. >> > Created the appropriate ObjectModels: >> > >> > GenericObjectModels - for reading and writing Iceberg Records >> > SparkObjectModels - for reading (vectorized and non-vectorized) and >> > writing Spark InternalRow/ColumnarBatch objects >> > FlinkObjectModels - for reading and writing Flink RowData objects >> > An arrow object model is also registered for vectorized reads of Parquet >> > files into Arrow ColumnarBatch objects >> > >> > Updated the production code where the reading and writing happens to use >> > the ObjectModelRegistry and the new reader/writer interfaces to access >> > data files >> > Kept the testing code intact to ensure that the new API/code is not >> > breaking anything >> > >> > The original change was not small, and grew substantially during the >> > review rounds. So if you have questions, or I can do anything to make the >> > review easier, don't hesitate to ask. I am happy to do anything to move >> > this forward. >> > >> > Thanks, >> > Peter >> > >> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. márc. >> > 26., Sze, 14:54): >> >> >> >> Hi everyone, >> >> >> >> I have updated the File Format API PR >> >> (https://github.com/apache/iceberg/pull/12298) based on the answers and >> >> review comments. >> >> >> >> I would like to merge this only after the 1.9.0 release so we have more >> >> time finding any issues and solving them before this goes to a release >> >> for the users. >> >> >> >> For this I have updated the deprecation comments accordingly. >> >> I would like to ask you to review the PR, so we iron out any possible >> >> requested changes and be ready for the merge as soon as possible after >> >> the 1.9.0 release. >> >> >> >> Thanks, >> >> Peter >> >> >> >> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. márc. >> >> 21., P, 14:32): >> >>> >> >>> Hi Renije, >> >>> >> >>> > 1. File format filters >> >>> > >> >>> > Do the filters include both filter expressions from both user query >> >>> > and delete filter? >> >>> >> >>> The current discussion is about the filters from the user query. >> >>> >> >>> About the delete filter: >> >>> Based on the suggestions on the PR, I have moved the delete filter out >> >>> from the main API. Created a `SupportsDeleteFilter` interface for it >> >>> which would allow pushing down to the filter to Parquet vectorized >> >>> readers in Spark, as this is the only place where we currently >> >>> implemented this feature. >> >>> >> >>> >> >>> Renjie Liu <liurenjie2...@gmail.com> ezt írta (időpont: 2025. márc. 21., >> >>> P, 14:11): >> >>>> >> >>>> Hi, Peter: >> >>>> >> >>>> Thanks for the effort on this. >> >>>> >> >>>> 1. File format filters >> >>>> >> >>>> Do the filters include both filter expressions from both user query and >> >>>> delete filter? >> >>>> >> >>>> For filters from user query, I agree with you that we should keep the >> >>>> current behavior. >> >>>> >> >>>> For delete filters associated with data files, at first I thought file >> >>>> format readers should not care about this. But now I realized that >> >>>> maybe we need to also push it to file reader, this is useful when >> >>>> `IS_DELETED` metadata column is not necessary and we could use these >> >>>> filters (position deletes, etc) to further prune data. >> >>>> >> >>>> But anyway, I agree that we could postpone it in follow up pr. >> >>>> >> >>>> 2. Batch size configuration >> >>>> >> >>>> I'm leaning toward option 2. >> >>>> >> >>>> 3. Spark configuration >> >>>> >> >>>> I'm leaning towards using different configuration objects. >> >>>> >> >>>> >> >>>> >> >>>> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry >> >>>> <peter.vary.apa...@gmail.com> wrote: >> >>>>> >> >>>>> Hi Team, >> >>>>> Thanks everyone for the reviews on >> >>>>> https://github.com/apache/iceberg/pull/12298! >> >>>>> I have addressed most of comments, but a few questions still remain >> >>>>> which might merit a bit wider audience: >> >>>>> >> >>>>> We should decide on the expected filtering behavior when the filters >> >>>>> are pushed down to the readers. Currently the filters are applied as >> >>>>> best effort for the file format readers. Some readers (Avro) just skip >> >>>>> them altogether. There was a suggestion on the PR that we might >> >>>>> enforce more strict requirements and the readers either reject part of >> >>>>> the filters, or they could apply them fully. >> >>>>> Batch sizes are currently parameters for the reader builders which >> >>>>> could be set for non-vectorized readers too which could be confusing. >> >>>>> Currently the Spark batch reader uses different configuration objects >> >>>>> for ParquetBatchReadConf and OrcBatchReadConf as requested by the >> >>>>> reviewers of the Comet PR. There was a suggestion on the current PR to >> >>>>> use a common configuration instead. >> >>>>> >> >>>>> I would be interested in hearing your thoughts about these topics. >> >>>>> >> >>>>> My current take: >> >>>>> >> >>>>> File format filters: I am leaning towards keeping the current laninet >> >>>>> behavior. Especially since Bloom filters are not able to do a full >> >>>>> filtering, and are often used as a way to filter out unwanted records. >> >>>>> Another option would be to implement a secondary filtering inside the >> >>>>> file formats themselves which I think would cause extra complexity, >> >>>>> and possible code duplication. Whatever the decision here, I would >> >>>>> suggest moving this out to a next PR as the current changeset is big >> >>>>> enough as it is. >> >>>>> Batch size configuration: Currently this is the only property which is >> >>>>> different in the batch readers and the non-vectorized readers. I see 3 >> >>>>> possible solutions: >> >>>>> >> >>>>> Create different builders for vectorized and non-vectorized reads - I >> >>>>> don't think the current solution is confusing enough to worth the >> >>>>> extra class >> >>>>> We could put this to the reader configuration property set - This >> >>>>> could work, but "hide" the possible configuration mode which is valid >> >>>>> for both Parquet and ORC readers >> >>>>> We could keep things as it is now - I would chose this one, but I >> >>>>> don't have a strong opinion here >> >>>>> >> >>>>> Spark configuration: TBH, I'm open to bot solution and happy to move >> >>>>> to the direction the community decides on >> >>>>> >> >>>>> Thanks, >> >>>>> Peter >> >>>>> >> >>>>> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. >> >>>>> márc. 14., P, 16:31): >> >>>>>> >> >>>>>> Hi Peter >> >>>>>> >> >>>>>> Thanks for the update. I will do a new pass on the PR. >> >>>>>> >> >>>>>> Regards >> >>>>>> JB >> >>>>>> >> >>>>>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry >> >>>>>> <peter.vary.apa...@gmail.com> wrote: >> >>>>>> > >> >>>>>> > Hi Team, >> >>>>>> > I have rebased the File Format API proposal >> >>>>>> > (https://github.com/apache/iceberg/pull/12298) to include the new >> >>>>>> > changes needed for the Variant types. I would love to hear your >> >>>>>> > feedback, especially Dan and Ryan, as you were the most active >> >>>>>> > during our discussions. If I can help in any way to make the review >> >>>>>> > easier, please let me know. >> >>>>>> > Thanks, >> >>>>>> > Peter >> >>>>>> > >> >>>>>> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. >> >>>>>> > febr. 28., P, 17:50): >> >>>>>> >> >> >>>>>> >> Hi everyone, >> >>>>>> >> Thanks for all of the actionable, relevant feedback on the PR >> >>>>>> >> (https://github.com/apache/iceberg/pull/12298). >> >>>>>> >> Updated the code to address most of them. Please check if you >> >>>>>> >> agree with the general approach. >> >>>>>> >> If there is a consensus about the general approach, I could. >> >>>>>> >> separate out the PR to smaller pieces so we can have an easier >> >>>>>> >> time to review and merge those step-by-step. >> >>>>>> >> Thanks, >> >>>>>> >> Peter >> >>>>>> >> >> >>>>>> >> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. >> >>>>>> >> febr. 20., Cs, 14:14): >> >>>>>> >>> >> >>>>>> >>> Hi Peter >> >>>>>> >>> >> >>>>>> >>> sorry for the late reply on this. >> >>>>>> >>> >> >>>>>> >>> I did a pass on the proposal, it's very interesting and well >> >>>>>> >>> written. >> >>>>>> >>> I like the DataFile API and definitely worth to discuss all >> >>>>>> >>> together. >> >>>>>> >>> >> >>>>>> >>> Maybe we can schedule a specific meeting to discuss about >> >>>>>> >>> DataFile API ? >> >>>>>> >>> >> >>>>>> >>> Thoughts ? >> >>>>>> >>> >> >>>>>> >>> Regards >> >>>>>> >>> JB >> >>>>>> >>> >> >>>>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry >> >>>>>> >>> <peter.vary.apa...@gmail.com> wrote: >> >>>>>> >>> > >> >>>>>> >>> > Hi Team, >> >>>>>> >>> > >> >>>>>> >>> > As mentioned earlier on our Community Sync I am exploring the >> >>>>>> >>> > possibility to define a FileFormat API for accessing different >> >>>>>> >>> > file formats. I have put together a proposal based on my >> >>>>>> >>> > findings. >> >>>>>> >>> > >> >>>>>> >>> > ------------------- >> >>>>>> >>> > Iceberg currently supports 3 different file formats: Avro, >> >>>>>> >>> > Parquet, ORC. With the introduction of Iceberg V3 specification >> >>>>>> >>> > many new features are added to Iceberg. Some of these features >> >>>>>> >>> > like new column types, default values require changes at the >> >>>>>> >>> > file format level. The changes are added by individual >> >>>>>> >>> > developers with different focus on the different file formats. >> >>>>>> >>> > As a result not all of the features are available for every >> >>>>>> >>> > supported file format. >> >>>>>> >>> > Also there are emerging file formats like Vortex [1] or Lance >> >>>>>> >>> > [2] which either by specialization, or by applying newer >> >>>>>> >>> > research results could provide better alternatives for certain >> >>>>>> >>> > use-cases like random access for data, or storing ML models. >> >>>>>> >>> > ------------------- >> >>>>>> >>> > >> >>>>>> >>> > Please check the detailed proposal [3] and the google document >> >>>>>> >>> > [4], and comment there or reply on the dev list if you have any >> >>>>>> >>> > suggestions. >> >>>>>> >>> > >> >>>>>> >>> > Thanks, >> >>>>>> >>> > Peter >> >>>>>> >>> > >> >>>>>> >>> > [1] - https://github.com/spiraldb/vortex >> >>>>>> >>> > [2] - https://lancedb.github.io/lance/ >> >>>>>> >>> > [3] - https://github.com/apache/iceberg/issues/12225 >> >>>>>> >>> > [4] - >> >>>>>> >>> > https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds >> >>>>>> >>> >