Hi JB, Separated out the proposed interfaces to a new PR: https://github.com/apache/iceberg/pull/12774. Reviewers can check that out if they are only interested in how the new API would look like.
Thanks, Peter Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. ápr. 10., Cs, 18:25): > Hi Peter > > Thanks for the ping about the PR. > > Maybe, to facilitate the review and move forward faster, we should > split the PR in smaller PRs: > - one with the interfaces (ReadBuilder, AppenderBuilder, ObjectModel, > AppenderBuilder, DataWriterBuilder, ...) > - one for each file providers (Parquet, Avro, ORC) > > Thoughts ? I can help on the split if needed. > > Regards > JB > > On Thu, Apr 10, 2025 at 5:16 AM Péter Váry <peter.vary.apa...@gmail.com> > wrote: > > > > Since the 1.9.0 release candidate has been created, I would like to > resurrect this PR: https://github.com/apache/iceberg/pull/12298 to ensure > that we have as long a testing period as possible for it. > > > > To recap, here is what the PR does after the review rounds: > > > > Created 3 interface classes which are implemented by the file formats: > > > > ReadBuilder - Builder for reading data from data files > > AppenderBuilder - Builder for writing data to data files > > ObjectModel - Providing ReadBuilders, and AppenderBuilders for the > specific data file format and object model pair > > > > Updated the Parquet, Avro, ORC implementation for this interfaces, and > deprecated the old reader/writer APIs > > Created interface classes which will be used by the actual > readers/writers of the data files: > > > > AppenderBuilder - Builder for writing a file > > DataWriterBuilder - Builder for generating a data file > > PositionDeleteWriterBuilder - Builder for generating a position delete > file > > EqualityDeleteWriterBuilder - Builder for generating an equality delete > file > > No ReadBuilder here - the file format reader builder is reused > > > > Created a WriterBuilder class which implements the interfaces above > (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) > based on a provided file format specific AppenderBuilder > > Created an ObjectModelRegistry which stores the available ObjectModels, > and engines and users could request the readers (ReadBuilder) and writers > (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) > from. > > Created the appropriate ObjectModels: > > > > GenericObjectModels - for reading and writing Iceberg Records > > SparkObjectModels - for reading (vectorized and non-vectorized) and > writing Spark InternalRow/ColumnarBatch objects > > FlinkObjectModels - for reading and writing Flink RowData objects > > An arrow object model is also registered for vectorized reads of Parquet > files into Arrow ColumnarBatch objects > > > > Updated the production code where the reading and writing happens to use > the ObjectModelRegistry and the new reader/writer interfaces to access data > files > > Kept the testing code intact to ensure that the new API/code is not > breaking anything > > > > The original change was not small, and grew substantially during the > review rounds. So if you have questions, or I can do anything to make the > review easier, don't hesitate to ask. I am happy to do anything to move > this forward. > > > > Thanks, > > Peter > > > > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. márc. > 26., Sze, 14:54): > >> > >> Hi everyone, > >> > >> I have updated the File Format API PR ( > https://github.com/apache/iceberg/pull/12298) based on the answers and > review comments. > >> > >> I would like to merge this only after the 1.9.0 release so we have more > time finding any issues and solving them before this goes to a release for > the users. > >> > >> For this I have updated the deprecation comments accordingly. > >> I would like to ask you to review the PR, so we iron out any possible > requested changes and be ready for the merge as soon as possible after the > 1.9.0 release. > >> > >> Thanks, > >> Peter > >> > >> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. > márc. 21., P, 14:32): > >>> > >>> Hi Renije, > >>> > >>> > 1. File format filters > >>> > > >>> > Do the filters include both filter expressions from both user query > and delete filter? > >>> > >>> The current discussion is about the filters from the user query. > >>> > >>> About the delete filter: > >>> Based on the suggestions on the PR, I have moved the delete filter out > from the main API. Created a `SupportsDeleteFilter` interface for it which > would allow pushing down to the filter to Parquet vectorized readers in > Spark, as this is the only place where we currently implemented this > feature. > >>> > >>> > >>> Renjie Liu <liurenjie2...@gmail.com> ezt írta (időpont: 2025. márc. > 21., P, 14:11): > >>>> > >>>> Hi, Peter: > >>>> > >>>> Thanks for the effort on this. > >>>> > >>>> 1. File format filters > >>>> > >>>> Do the filters include both filter expressions from both user query > and delete filter? > >>>> > >>>> For filters from user query, I agree with you that we should keep the > current behavior. > >>>> > >>>> For delete filters associated with data files, at first I thought > file format readers should not care about this. But now I realized that > maybe we need to also push it to file reader, this is useful when > `IS_DELETED` metadata column is not necessary and we could use these > filters (position deletes, etc) to further prune data. > >>>> > >>>> But anyway, I agree that we could postpone it in follow up pr. > >>>> > >>>> 2. Batch size configuration > >>>> > >>>> I'm leaning toward option 2. > >>>> > >>>> 3. Spark configuration > >>>> > >>>> I'm leaning towards using different configuration objects. > >>>> > >>>> > >>>> > >>>> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry < > peter.vary.apa...@gmail.com> wrote: > >>>>> > >>>>> Hi Team, > >>>>> Thanks everyone for the reviews on > https://github.com/apache/iceberg/pull/12298! > >>>>> I have addressed most of comments, but a few questions still remain > which might merit a bit wider audience: > >>>>> > >>>>> We should decide on the expected filtering behavior when the filters > are pushed down to the readers. Currently the filters are applied as best > effort for the file format readers. Some readers (Avro) just skip them > altogether. There was a suggestion on the PR that we might enforce more > strict requirements and the readers either reject part of the filters, or > they could apply them fully. > >>>>> Batch sizes are currently parameters for the reader builders which > could be set for non-vectorized readers too which could be confusing. > >>>>> Currently the Spark batch reader uses different configuration > objects for ParquetBatchReadConf and OrcBatchReadConf as requested by the > reviewers of the Comet PR. There was a suggestion on the current PR to use > a common configuration instead. > >>>>> > >>>>> I would be interested in hearing your thoughts about these topics. > >>>>> > >>>>> My current take: > >>>>> > >>>>> File format filters: I am leaning towards keeping the current > laninet behavior. Especially since Bloom filters are not able to do a full > filtering, and are often used as a way to filter out unwanted records. > Another option would be to implement a secondary filtering inside the file > formats themselves which I think would cause extra complexity, and possible > code duplication. Whatever the decision here, I would suggest moving this > out to a next PR as the current changeset is big enough as it is. > >>>>> Batch size configuration: Currently this is the only property which > is different in the batch readers and the non-vectorized readers. I see 3 > possible solutions: > >>>>> > >>>>> Create different builders for vectorized and non-vectorized reads - > I don't think the current solution is confusing enough to worth the extra > class > >>>>> We could put this to the reader configuration property set - This > could work, but "hide" the possible configuration mode which is valid for > both Parquet and ORC readers > >>>>> We could keep things as it is now - I would chose this one, but I > don't have a strong opinion here > >>>>> > >>>>> Spark configuration: TBH, I'm open to bot solution and happy to move > to the direction the community decides on > >>>>> > >>>>> Thanks, > >>>>> Peter > >>>>> > >>>>> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. > márc. 14., P, 16:31): > >>>>>> > >>>>>> Hi Peter > >>>>>> > >>>>>> Thanks for the update. I will do a new pass on the PR. > >>>>>> > >>>>>> Regards > >>>>>> JB > >>>>>> > >>>>>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry < > peter.vary.apa...@gmail.com> wrote: > >>>>>> > > >>>>>> > Hi Team, > >>>>>> > I have rebased the File Format API proposal ( > https://github.com/apache/iceberg/pull/12298) to include the new changes > needed for the Variant types. I would love to hear your feedback, > especially Dan and Ryan, as you were the most active during our > discussions. If I can help in any way to make the review easier, please let > me know. > >>>>>> > Thanks, > >>>>>> > Peter > >>>>>> > > >>>>>> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: > 2025. febr. 28., P, 17:50): > >>>>>> >> > >>>>>> >> Hi everyone, > >>>>>> >> Thanks for all of the actionable, relevant feedback on the PR ( > https://github.com/apache/iceberg/pull/12298). > >>>>>> >> Updated the code to address most of them. Please check if you > agree with the general approach. > >>>>>> >> If there is a consensus about the general approach, I could. > separate out the PR to smaller pieces so we can have an easier time to > review and merge those step-by-step. > >>>>>> >> Thanks, > >>>>>> >> Peter > >>>>>> >> > >>>>>> >> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. > febr. 20., Cs, 14:14): > >>>>>> >>> > >>>>>> >>> Hi Peter > >>>>>> >>> > >>>>>> >>> sorry for the late reply on this. > >>>>>> >>> > >>>>>> >>> I did a pass on the proposal, it's very interesting and well > written. > >>>>>> >>> I like the DataFile API and definitely worth to discuss all > together. > >>>>>> >>> > >>>>>> >>> Maybe we can schedule a specific meeting to discuss about > DataFile API ? > >>>>>> >>> > >>>>>> >>> Thoughts ? > >>>>>> >>> > >>>>>> >>> Regards > >>>>>> >>> JB > >>>>>> >>> > >>>>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry < > peter.vary.apa...@gmail.com> wrote: > >>>>>> >>> > > >>>>>> >>> > Hi Team, > >>>>>> >>> > > >>>>>> >>> > As mentioned earlier on our Community Sync I am exploring the > possibility to define a FileFormat API for accessing different file > formats. I have put together a proposal based on my findings. > >>>>>> >>> > > >>>>>> >>> > ------------------- > >>>>>> >>> > Iceberg currently supports 3 different file formats: Avro, > Parquet, ORC. With the introduction of Iceberg V3 specification many new > features are added to Iceberg. Some of these features like new column > types, default values require changes at the file format level. The changes > are added by individual developers with different focus on the different > file formats. As a result not all of the features are available for every > supported file format. > >>>>>> >>> > Also there are emerging file formats like Vortex [1] or Lance > [2] which either by specialization, or by applying newer research results > could provide better alternatives for certain use-cases like random access > for data, or storing ML models. > >>>>>> >>> > ------------------- > >>>>>> >>> > > >>>>>> >>> > Please check the detailed proposal [3] and the google > document [4], and comment there or reply on the dev list if you have any > suggestions. > >>>>>> >>> > > >>>>>> >>> > Thanks, > >>>>>> >>> > Peter > >>>>>> >>> > > >>>>>> >>> > [1] - https://github.com/spiraldb/vortex > >>>>>> >>> > [2] - https://lancedb.github.io/lance/ > >>>>>> >>> > [3] - https://github.com/apache/iceberg/issues/12225 > >>>>>> >>> > [4] - > https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds > >>>>>> >>> > >