Re: [DISCUSS] FileFormat API proposal

Jean-Baptiste Onofré Thu, 10 Apr 2025 09:27:33 -0700

Hi Peter

Thanks for the ping about the PR.


Maybe, to facilitate the review and move forward faster, we should
split the PR in smaller PRs:
- one with the interfaces (ReadBuilder, AppenderBuilder, ObjectModel,
AppenderBuilder, DataWriterBuilder, ...)
- one for each file providers (Parquet, Avro, ORC)

Thoughts ? I can help on the split if needed.

Regards
JB

On Thu, Apr 10, 2025 at 5:16 AM Péter Váry <[email protected]> wrote:
>
> Since the 1.9.0 release candidate has been created, I would like to resurrect 
> this PR: https://github.com/apache/iceberg/pull/12298 to ensure that we have 
> as long a testing period as possible for it.
>
> To recap, here is what the PR does after the review rounds:
>
> Created 3 interface classes which are implemented by the file formats:
>
> ReadBuilder - Builder for reading data from data files
> AppenderBuilder - Builder for writing data to data files
> ObjectModel - Providing ReadBuilders, and AppenderBuilders for the specific 
> data file format and object model pair
>
> Updated the Parquet, Avro, ORC implementation for this interfaces, and 
> deprecated the old reader/writer APIs
> Created interface classes which will be used by the actual readers/writers of 
> the data files:
>
> AppenderBuilder - Builder for writing a file
> DataWriterBuilder - Builder for generating a data file
> PositionDeleteWriterBuilder - Builder for generating a position delete file
> EqualityDeleteWriterBuilder - Builder for generating an equality delete file
> No ReadBuilder here - the file format reader builder is reused
>
> Created a WriterBuilder class which implements the interfaces above 
> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>  based on a provided file format specific AppenderBuilder
> Created an ObjectModelRegistry which stores the available ObjectModels, and 
> engines and users could request the readers (ReadBuilder) and writers 
> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>  from.
> Created the appropriate ObjectModels:
>
> GenericObjectModels - for reading and writing Iceberg Records
> SparkObjectModels - for reading (vectorized and non-vectorized) and writing 
> Spark InternalRow/ColumnarBatch objects
> FlinkObjectModels - for reading and writing Flink RowData objects
> An arrow object model is also registered for vectorized reads of Parquet 
> files into Arrow ColumnarBatch objects
>
> Updated the production code where the reading and writing happens to use the 
> ObjectModelRegistry and the new reader/writer interfaces to access data files
> Kept the testing code intact to ensure that the new API/code is not breaking 
> anything
>
> The original change was not small, and grew substantially during the review 
> rounds. So if you have questions, or I can do anything to make the review 
> easier, don't hesitate to ask. I am happy to do anything to move this forward.
>
> Thanks,
> Peter
>
> Péter Váry <[email protected]> ezt írta (időpont: 2025. márc. 26., 
> Sze, 14:54):
>>
>> Hi everyone,
>>
>> I have updated the File Format API PR 
>> (https://github.com/apache/iceberg/pull/12298) based on the answers and 
>> review comments.
>>
>> I would like to merge this only after the 1.9.0 release so we have more time 
>> finding any issues and solving them before this goes to a release for the 
>> users.
>>
>> For this I have updated the deprecation comments accordingly.
>> I would like to ask you to review the PR, so we iron out any possible 
>> requested changes and be ready for the merge as soon as possible after the 
>> 1.9.0 release.
>>
>> Thanks,
>> Peter
>>
>> Péter Váry <[email protected]> ezt írta (időpont: 2025. márc. 21., 
>> P, 14:32):
>>>
>>> Hi Renije,
>>>
>>> > 1. File format filters
>>> >
>>> > Do the filters include both filter expressions from both user query and 
>>> > delete filter?
>>>
>>> The current discussion is about the filters from the user query.
>>>
>>> About the delete filter:
>>> Based on the suggestions on the PR, I have moved the delete filter out from 
>>> the main API. Created a `SupportsDeleteFilter` interface for it which would 
>>> allow pushing down to the filter to Parquet vectorized readers in Spark, as 
>>> this is the only place where we currently implemented this feature.
>>>
>>>
>>> Renjie Liu <[email protected]> ezt írta (időpont: 2025. márc. 21., P, 
>>> 14:11):
>>>>
>>>> Hi, Peter:
>>>>
>>>> Thanks for the effort on this.
>>>>
>>>> 1. File format filters
>>>>
>>>> Do the filters include both filter expressions from both user query and 
>>>> delete filter?
>>>>
>>>> For filters from user query, I agree with you that we should keep the 
>>>> current behavior.
>>>>
>>>> For delete filters associated with data files, at first I thought file 
>>>> format readers should not care about this. But now I realized that maybe 
>>>> we need to also push it to file reader, this is useful when `IS_DELETED` 
>>>> metadata column is not necessary and we could use these filters (position 
>>>> deletes, etc) to further prune data.
>>>>
>>>> But anyway, I agree that we could postpone it in follow up pr.
>>>>
>>>> 2. Batch size configuration
>>>>
>>>> I'm leaning toward option 2.
>>>>
>>>> 3. Spark configuration
>>>>
>>>> I'm leaning towards using different configuration objects.
>>>>
>>>>
>>>>
>>>> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry <[email protected]> 
>>>> wrote:
>>>>>
>>>>> Hi Team,
>>>>> Thanks everyone for the reviews on 
>>>>> https://github.com/apache/iceberg/pull/12298!
>>>>> I have addressed most of comments, but a few questions still remain which 
>>>>> might merit a bit wider audience:
>>>>>
>>>>> We should decide on the expected filtering behavior when the filters are 
>>>>> pushed down to the readers. Currently the filters are applied as best 
>>>>> effort for the file format readers. Some readers (Avro) just skip them 
>>>>> altogether. There was a suggestion on the PR that we might enforce more 
>>>>> strict requirements and the readers either reject part of the filters, or 
>>>>> they could apply them fully.
>>>>> Batch sizes are currently parameters for the reader builders which could 
>>>>> be set for non-vectorized readers too which could be confusing.
>>>>> Currently the Spark batch reader uses different configuration objects for 
>>>>> ParquetBatchReadConf and OrcBatchReadConf as requested by the reviewers 
>>>>> of the Comet PR. There was a suggestion on the current PR to use a common 
>>>>> configuration instead.
>>>>>
>>>>> I would be interested in hearing your thoughts about these topics.
>>>>>
>>>>> My current take:
>>>>>
>>>>> File format filters: I am leaning towards keeping the current laninet 
>>>>> behavior. Especially since Bloom filters are not able to do a full 
>>>>> filtering, and are often used as a way to filter out unwanted records. 
>>>>> Another option would be to implement a secondary filtering inside the 
>>>>> file formats themselves which I think would cause extra complexity, and 
>>>>> possible code duplication. Whatever the decision here, I would suggest 
>>>>> moving this out to a next PR as the current changeset is big enough as it 
>>>>> is.
>>>>> Batch size configuration: Currently this is the only property which is 
>>>>> different in the batch readers and the non-vectorized readers. I see 3 
>>>>> possible solutions:
>>>>>
>>>>> Create different builders for vectorized and non-vectorized reads - I 
>>>>> don't think the current solution is confusing enough to worth the extra 
>>>>> class
>>>>> We could put this to the reader configuration property set - This could 
>>>>> work, but "hide" the possible configuration mode which is valid for both 
>>>>> Parquet and ORC readers
>>>>> We could keep things as it is now - I would chose this one, but I don't 
>>>>> have a strong opinion here
>>>>>
>>>>> Spark configuration: TBH, I'm open to bot solution and happy to move to 
>>>>> the direction the community decides on
>>>>>
>>>>> Thanks,
>>>>> Peter
>>>>>
>>>>> Jean-Baptiste Onofré <[email protected]> ezt írta (időpont: 2025. márc. 
>>>>> 14., P, 16:31):
>>>>>>
>>>>>> Hi Peter
>>>>>>
>>>>>> Thanks for the update. I will do a new pass on the PR.
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry <[email protected]> 
>>>>>> wrote:
>>>>>> >
>>>>>> > Hi Team,
>>>>>> > I have rebased the File Format API proposal 
>>>>>> > (https://github.com/apache/iceberg/pull/12298) to include the new 
>>>>>> > changes needed for the Variant types. I would love to hear your 
>>>>>> > feedback, especially Dan and Ryan, as you were the most active during 
>>>>>> > our discussions. If I can help in any way to make the review easier, 
>>>>>> > please let me know.
>>>>>> > Thanks,
>>>>>> > Peter
>>>>>> >
>>>>>> > Péter Váry <[email protected]> ezt írta (időpont: 2025. 
>>>>>> > febr. 28., P, 17:50):
>>>>>> >>
>>>>>> >> Hi everyone,
>>>>>> >> Thanks for all of the actionable, relevant feedback on the PR 
>>>>>> >> (https://github.com/apache/iceberg/pull/12298).
>>>>>> >> Updated the code to address most of them. Please check if you agree 
>>>>>> >> with the general approach.
>>>>>> >> If there is a consensus about the general approach, I could. separate 
>>>>>> >> out the PR to smaller pieces so we can have an easier time to review 
>>>>>> >> and merge those step-by-step.
>>>>>> >> Thanks,
>>>>>> >> Peter
>>>>>> >>
>>>>>> >> Jean-Baptiste Onofré <[email protected]> ezt írta (időpont: 2025. 
>>>>>> >> febr. 20., Cs, 14:14):
>>>>>> >>>
>>>>>> >>> Hi Peter
>>>>>> >>>
>>>>>> >>> sorry for the late reply on this.
>>>>>> >>>
>>>>>> >>> I did a pass on the proposal, it's very interesting and well written.
>>>>>> >>> I like the DataFile API and definitely worth to discuss all together.
>>>>>> >>>
>>>>>> >>> Maybe we can schedule a specific meeting to discuss about DataFile 
>>>>>> >>> API ?
>>>>>> >>>
>>>>>> >>> Thoughts ?
>>>>>> >>>
>>>>>> >>> Regards
>>>>>> >>> JB
>>>>>> >>>
>>>>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry 
>>>>>> >>> <[email protected]> wrote:
>>>>>> >>> >
>>>>>> >>> > Hi Team,
>>>>>> >>> >
>>>>>> >>> > As mentioned earlier on our Community Sync I am exploring the 
>>>>>> >>> > possibility to define a FileFormat API for accessing different 
>>>>>> >>> > file formats. I have put together a proposal based on my findings.
>>>>>> >>> >
>>>>>> >>> > -------------------
>>>>>> >>> > Iceberg currently supports 3 different file formats: Avro, 
>>>>>> >>> > Parquet, ORC. With the introduction of Iceberg V3 specification 
>>>>>> >>> > many new features are added to Iceberg. Some of these features 
>>>>>> >>> > like new column types, default values require changes at the file 
>>>>>> >>> > format level. The changes are added by individual developers with 
>>>>>> >>> > different focus on the different file formats. As a result not all 
>>>>>> >>> > of the features are available for every supported file format.
>>>>>> >>> > Also there are emerging file formats like Vortex [1] or Lance [2] 
>>>>>> >>> > which either by specialization, or by applying newer research 
>>>>>> >>> > results could provide better alternatives for certain use-cases 
>>>>>> >>> > like random access for data, or storing ML models.
>>>>>> >>> > -------------------
>>>>>> >>> >
>>>>>> >>> > Please check the detailed proposal [3] and the google document 
>>>>>> >>> > [4], and comment there or reply on the dev list if you have any 
>>>>>> >>> > suggestions.
>>>>>> >>> >
>>>>>> >>> > Thanks,
>>>>>> >>> > Peter
>>>>>> >>> >
>>>>>> >>> > [1] - https://github.com/spiraldb/vortex
>>>>>> >>> > [2] - https://lancedb.github.io/lance/
>>>>>> >>> > [3] - https://github.com/apache/iceberg/issues/12225
>>>>>> >>> > [4] - 
>>>>>> >>> > https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds
>>>>>> >>> >

Re: [DISCUSS] FileFormat API proposal

Reply via email to