Re: [DISCUSS] FileFormat API proposal

Jean-Baptiste Onofré Mon, 14 Apr 2025 12:28:49 -0700

Hi Peter

Awesome ! Thank you so much !
I will do a new pass.


Regards
JB

On Fri, Apr 11, 2025 at 3:48 PM Péter Váry <[email protected]> wrote:
>
> Hi JB,
>
> Separated out the proposed interfaces to a new PR: 
> https://github.com/apache/iceberg/pull/12774.
> Reviewers can check that out if they are only interested in how the new API 
> would look like.
>
> Thanks,
> Peter
>
> Jean-Baptiste Onofré <[email protected]> ezt írta (időpont: 2025. ápr. 10., 
> Cs, 18:25):
>>
>> Hi Peter
>>
>> Thanks for the ping about the PR.
>>
>> Maybe, to facilitate the review and move forward faster, we should
>> split the PR in smaller PRs:
>> - one with the interfaces (ReadBuilder, AppenderBuilder, ObjectModel,
>> AppenderBuilder, DataWriterBuilder, ...)
>> - one for each file providers (Parquet, Avro, ORC)
>>
>> Thoughts ? I can help on the split if needed.
>>
>> Regards
>> JB
>>
>> On Thu, Apr 10, 2025 at 5:16 AM Péter Váry <[email protected]> 
>> wrote:
>> >
>> > Since the 1.9.0 release candidate has been created, I would like to 
>> > resurrect this PR: https://github.com/apache/iceberg/pull/12298 to ensure 
>> > that we have as long a testing period as possible for it.
>> >
>> > To recap, here is what the PR does after the review rounds:
>> >
>> > Created 3 interface classes which are implemented by the file formats:
>> >
>> > ReadBuilder - Builder for reading data from data files
>> > AppenderBuilder - Builder for writing data to data files
>> > ObjectModel - Providing ReadBuilders, and AppenderBuilders for the 
>> > specific data file format and object model pair
>> >
>> > Updated the Parquet, Avro, ORC implementation for this interfaces, and 
>> > deprecated the old reader/writer APIs
>> > Created interface classes which will be used by the actual readers/writers 
>> > of the data files:
>> >
>> > AppenderBuilder - Builder for writing a file
>> > DataWriterBuilder - Builder for generating a data file
>> > PositionDeleteWriterBuilder - Builder for generating a position delete file
>> > EqualityDeleteWriterBuilder - Builder for generating an equality delete 
>> > file
>> > No ReadBuilder here - the file format reader builder is reused
>> >
>> > Created a WriterBuilder class which implements the interfaces above 
>> > (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>> >  based on a provided file format specific AppenderBuilder
>> > Created an ObjectModelRegistry which stores the available ObjectModels, 
>> > and engines and users could request the readers (ReadBuilder) and writers 
>> > (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>> >  from.
>> > Created the appropriate ObjectModels:
>> >
>> > GenericObjectModels - for reading and writing Iceberg Records
>> > SparkObjectModels - for reading (vectorized and non-vectorized) and 
>> > writing Spark InternalRow/ColumnarBatch objects
>> > FlinkObjectModels - for reading and writing Flink RowData objects
>> > An arrow object model is also registered for vectorized reads of Parquet 
>> > files into Arrow ColumnarBatch objects
>> >
>> > Updated the production code where the reading and writing happens to use 
>> > the ObjectModelRegistry and the new reader/writer interfaces to access 
>> > data files
>> > Kept the testing code intact to ensure that the new API/code is not 
>> > breaking anything
>> >
>> > The original change was not small, and grew substantially during the 
>> > review rounds. So if you have questions, or I can do anything to make the 
>> > review easier, don't hesitate to ask. I am happy to do anything to move 
>> > this forward.
>> >
>> > Thanks,
>> > Peter
>> >
>> > Péter Váry <[email protected]> ezt írta (időpont: 2025. márc. 
>> > 26., Sze, 14:54):
>> >>
>> >> Hi everyone,
>> >>
>> >> I have updated the File Format API PR 
>> >> (https://github.com/apache/iceberg/pull/12298) based on the answers and 
>> >> review comments.
>> >>
>> >> I would like to merge this only after the 1.9.0 release so we have more 
>> >> time finding any issues and solving them before this goes to a release 
>> >> for the users.
>> >>
>> >> For this I have updated the deprecation comments accordingly.
>> >> I would like to ask you to review the PR, so we iron out any possible 
>> >> requested changes and be ready for the merge as soon as possible after 
>> >> the 1.9.0 release.
>> >>
>> >> Thanks,
>> >> Peter
>> >>
>> >> Péter Váry <[email protected]> ezt írta (időpont: 2025. márc. 
>> >> 21., P, 14:32):
>> >>>
>> >>> Hi Renije,
>> >>>
>> >>> > 1. File format filters
>> >>> >
>> >>> > Do the filters include both filter expressions from both user query 
>> >>> > and delete filter?
>> >>>
>> >>> The current discussion is about the filters from the user query.
>> >>>
>> >>> About the delete filter:
>> >>> Based on the suggestions on the PR, I have moved the delete filter out 
>> >>> from the main API. Created a `SupportsDeleteFilter` interface for it 
>> >>> which would allow pushing down to the filter to Parquet vectorized 
>> >>> readers in Spark, as this is the only place where we currently 
>> >>> implemented this feature.
>> >>>
>> >>>
>> >>> Renjie Liu <[email protected]> ezt írta (időpont: 2025. márc. 21., 
>> >>> P, 14:11):
>> >>>>
>> >>>> Hi, Peter:
>> >>>>
>> >>>> Thanks for the effort on this.
>> >>>>
>> >>>> 1. File format filters
>> >>>>
>> >>>> Do the filters include both filter expressions from both user query and 
>> >>>> delete filter?
>> >>>>
>> >>>> For filters from user query, I agree with you that we should keep the 
>> >>>> current behavior.
>> >>>>
>> >>>> For delete filters associated with data files, at first I thought file 
>> >>>> format readers should not care about this. But now I realized that 
>> >>>> maybe we need to also push it to file reader, this is useful when 
>> >>>> `IS_DELETED` metadata column is not necessary and we could use these 
>> >>>> filters (position deletes, etc) to further prune data.
>> >>>>
>> >>>> But anyway, I agree that we could postpone it in follow up pr.
>> >>>>
>> >>>> 2. Batch size configuration
>> >>>>
>> >>>> I'm leaning toward option 2.
>> >>>>
>> >>>> 3. Spark configuration
>> >>>>
>> >>>> I'm leaning towards using different configuration objects.
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry 
>> >>>> <[email protected]> wrote:
>> >>>>>
>> >>>>> Hi Team,
>> >>>>> Thanks everyone for the reviews on 
>> >>>>> https://github.com/apache/iceberg/pull/12298!
>> >>>>> I have addressed most of comments, but a few questions still remain 
>> >>>>> which might merit a bit wider audience:
>> >>>>>
>> >>>>> We should decide on the expected filtering behavior when the filters 
>> >>>>> are pushed down to the readers. Currently the filters are applied as 
>> >>>>> best effort for the file format readers. Some readers (Avro) just skip 
>> >>>>> them altogether. There was a suggestion on the PR that we might 
>> >>>>> enforce more strict requirements and the readers either reject part of 
>> >>>>> the filters, or they could apply them fully.
>> >>>>> Batch sizes are currently parameters for the reader builders which 
>> >>>>> could be set for non-vectorized readers too which could be confusing.
>> >>>>> Currently the Spark batch reader uses different configuration objects 
>> >>>>> for ParquetBatchReadConf and OrcBatchReadConf as requested by the 
>> >>>>> reviewers of the Comet PR. There was a suggestion on the current PR to 
>> >>>>> use a common configuration instead.
>> >>>>>
>> >>>>> I would be interested in hearing your thoughts about these topics.
>> >>>>>
>> >>>>> My current take:
>> >>>>>
>> >>>>> File format filters: I am leaning towards keeping the current laninet 
>> >>>>> behavior. Especially since Bloom filters are not able to do a full 
>> >>>>> filtering, and are often used as a way to filter out unwanted records. 
>> >>>>> Another option would be to implement a secondary filtering inside the 
>> >>>>> file formats themselves which I think would cause extra complexity, 
>> >>>>> and possible code duplication. Whatever the decision here, I would 
>> >>>>> suggest moving this out to a next PR as the current changeset is big 
>> >>>>> enough as it is.
>> >>>>> Batch size configuration: Currently this is the only property which is 
>> >>>>> different in the batch readers and the non-vectorized readers. I see 3 
>> >>>>> possible solutions:
>> >>>>>
>> >>>>> Create different builders for vectorized and non-vectorized reads - I 
>> >>>>> don't think the current solution is confusing enough to worth the 
>> >>>>> extra class
>> >>>>> We could put this to the reader configuration property set - This 
>> >>>>> could work, but "hide" the possible configuration mode which is valid 
>> >>>>> for both Parquet and ORC readers
>> >>>>> We could keep things as it is now - I would chose this one, but I 
>> >>>>> don't have a strong opinion here
>> >>>>>
>> >>>>> Spark configuration: TBH, I'm open to bot solution and happy to move 
>> >>>>> to the direction the community decides on
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Peter
>> >>>>>
>> >>>>> Jean-Baptiste Onofré <[email protected]> ezt írta (időpont: 2025. 
>> >>>>> márc. 14., P, 16:31):
>> >>>>>>
>> >>>>>> Hi Peter
>> >>>>>>
>> >>>>>> Thanks for the update. I will do a new pass on the PR.
>> >>>>>>
>> >>>>>> Regards
>> >>>>>> JB
>> >>>>>>
>> >>>>>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry 
>> >>>>>> <[email protected]> wrote:
>> >>>>>> >
>> >>>>>> > Hi Team,
>> >>>>>> > I have rebased the File Format API proposal 
>> >>>>>> > (https://github.com/apache/iceberg/pull/12298) to include the new 
>> >>>>>> > changes needed for the Variant types. I would love to hear your 
>> >>>>>> > feedback, especially Dan and Ryan, as you were the most active 
>> >>>>>> > during our discussions. If I can help in any way to make the review 
>> >>>>>> > easier, please let me know.
>> >>>>>> > Thanks,
>> >>>>>> > Peter
>> >>>>>> >
>> >>>>>> > Péter Váry <[email protected]> ezt írta (időpont: 2025. 
>> >>>>>> > febr. 28., P, 17:50):
>> >>>>>> >>
>> >>>>>> >> Hi everyone,
>> >>>>>> >> Thanks for all of the actionable, relevant feedback on the PR 
>> >>>>>> >> (https://github.com/apache/iceberg/pull/12298).
>> >>>>>> >> Updated the code to address most of them. Please check if you 
>> >>>>>> >> agree with the general approach.
>> >>>>>> >> If there is a consensus about the general approach, I could. 
>> >>>>>> >> separate out the PR to smaller pieces so we can have an easier 
>> >>>>>> >> time to review and merge those step-by-step.
>> >>>>>> >> Thanks,
>> >>>>>> >> Peter
>> >>>>>> >>
>> >>>>>> >> Jean-Baptiste Onofré <[email protected]> ezt írta (időpont: 2025. 
>> >>>>>> >> febr. 20., Cs, 14:14):
>> >>>>>> >>>
>> >>>>>> >>> Hi Peter
>> >>>>>> >>>
>> >>>>>> >>> sorry for the late reply on this.
>> >>>>>> >>>
>> >>>>>> >>> I did a pass on the proposal, it's very interesting and well 
>> >>>>>> >>> written.
>> >>>>>> >>> I like the DataFile API and definitely worth to discuss all 
>> >>>>>> >>> together.
>> >>>>>> >>>
>> >>>>>> >>> Maybe we can schedule a specific meeting to discuss about 
>> >>>>>> >>> DataFile API ?
>> >>>>>> >>>
>> >>>>>> >>> Thoughts ?
>> >>>>>> >>>
>> >>>>>> >>> Regards
>> >>>>>> >>> JB
>> >>>>>> >>>
>> >>>>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry 
>> >>>>>> >>> <[email protected]> wrote:
>> >>>>>> >>> >
>> >>>>>> >>> > Hi Team,
>> >>>>>> >>> >
>> >>>>>> >>> > As mentioned earlier on our Community Sync I am exploring the 
>> >>>>>> >>> > possibility to define a FileFormat API for accessing different 
>> >>>>>> >>> > file formats. I have put together a proposal based on my 
>> >>>>>> >>> > findings.
>> >>>>>> >>> >
>> >>>>>> >>> > -------------------
>> >>>>>> >>> > Iceberg currently supports 3 different file formats: Avro, 
>> >>>>>> >>> > Parquet, ORC. With the introduction of Iceberg V3 specification 
>> >>>>>> >>> > many new features are added to Iceberg. Some of these features 
>> >>>>>> >>> > like new column types, default values require changes at the 
>> >>>>>> >>> > file format level. The changes are added by individual 
>> >>>>>> >>> > developers with different focus on the different file formats. 
>> >>>>>> >>> > As a result not all of the features are available for every 
>> >>>>>> >>> > supported file format.
>> >>>>>> >>> > Also there are emerging file formats like Vortex [1] or Lance 
>> >>>>>> >>> > [2] which either by specialization, or by applying newer 
>> >>>>>> >>> > research results could provide better alternatives for certain 
>> >>>>>> >>> > use-cases like random access for data, or storing ML models.
>> >>>>>> >>> > -------------------
>> >>>>>> >>> >
>> >>>>>> >>> > Please check the detailed proposal [3] and the google document 
>> >>>>>> >>> > [4], and comment there or reply on the dev list if you have any 
>> >>>>>> >>> > suggestions.
>> >>>>>> >>> >
>> >>>>>> >>> > Thanks,
>> >>>>>> >>> > Peter
>> >>>>>> >>> >
>> >>>>>> >>> > [1] - https://github.com/spiraldb/vortex
>> >>>>>> >>> > [2] - https://lancedb.github.io/lance/
>> >>>>>> >>> > [3] - https://github.com/apache/iceberg/issues/12225
>> >>>>>> >>> > [4] - 
>> >>>>>> >>> > https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds
>> >>>>>> >>> >

Re: [DISCUSS] FileFormat API proposal

Reply via email to