Re: [DISCUSS] FileFormat API proposal

Péter Váry Thu, 10 Apr 2025 05:17:44 -0700

Since the 1.9.0 release candidate has been created, I would like to
resurrect this PR: https://github.com/apache/iceberg/pull/12298 to ensure
that we have as long a testing period as possible for it.


To recap, here is what the PR does after the review rounds:

   - *Created 3 interface classes which are implemented by the file formats*
   :
      - ReadBuilder - Builder for reading data from data files
      - AppenderBuilder - Builder for writing data to data files
      - ObjectModel - Providing ReadBuilders, and AppenderBuilders for the
      specific data file format and object model pair
   - *Updated the Parquet, Avro, ORC implementation* for this interfaces,
   and deprecated the old reader/writer APIs
   - *Created interface classes which will be used by the actual
   readers/writers of the data files*:
      - AppenderBuilder - Builder for writing a file
      - DataWriterBuilder - Builder for generating a data file
      - PositionDeleteWriterBuilder - Builder for generating a position
      delete file
      - EqualityDeleteWriterBuilder - Builder for generating an equality
      delete file
      - No ReadBuilder here - the file format reader builder is reused
   - *Created a WriterBuilder class which implements the interfaces
above 
*(AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
   based on a provided file format specific AppenderBuilder
   - *Created an ObjectModelRegistry which stores the available
   ObjectModels*, and engines and users could request the readers
   (ReadBuilder) and writers
   
(AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
   from.
   - *Created the appropriate ObjectModels*:
      - GenericObjectModels - for reading and writing Iceberg Records
      - SparkObjectModels - for reading (vectorized and non-vectorized) and
      writing Spark InternalRow/ColumnarBatch objects
      - FlinkObjectModels - for reading and writing Flink RowData objects
      - An arrow object model is also registered for vectorized reads of
      Parquet files into Arrow ColumnarBatch objects
   - *Updated the production code where the reading and writing happens *to
   use the ObjectModelRegistry and the new reader/writer interfaces to access
   data files
   - *Kept the testing code intact* to ensure that the new API/code is not
   breaking anything

The original change was not small, and grew substantially during the review
rounds. So if you have questions, or I can do anything to make the review
easier, don't hesitate to ask. I am happy to do anything to move this
forward.

Thanks,
Peter

Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. márc.
26., Sze, 14:54):

> Hi everyone,
>
> I have updated the File Format API PR (
> https://github.com/apache/iceberg/pull/12298) based on the answers and
> review comments.
>
> I would like to merge this only after the 1.9.0 release so we have more
> time finding any issues and solving them before this goes to a release for
> the users.
>
> For this I have updated the deprecation comments accordingly.
> I would like to ask you to review the PR, so we iron out any possible
> requested changes and be ready for the merge as soon as possible after the
> 1.9.0 release.
>
> Thanks,
> Peter
>
> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. márc.
> 21., P, 14:32):
>
>> Hi Renije,
>>
>> *> 1. **File format filters*
>> *>*
>> > Do the filters include both filter expressions from both user query and
>> delete filter?
>>
>> The current discussion is about the filters from the user query.
>>
>> About the delete filter:
>> Based on the suggestions on the PR, I have moved the delete filter out
>> from the main API. Created a `SupportsDeleteFilter` interface for it which
>> would allow pushing down to the filter to Parquet vectorized readers in
>> Spark, as this is the only place where we currently implemented this
>> feature.
>>
>>
>> Renjie Liu <liurenjie2...@gmail.com> ezt írta (időpont: 2025. márc. 21.,
>> P, 14:11):
>>
>>> Hi, Peter:
>>>
>>> Thanks for the effort on this.
>>>
>>> *1. **File format filters*
>>>
>>> Do the filters include both filter expressions from both user query and
>>> delete filter?
>>>
>>> For filters from user query, I agree with you that we should keep the
>>> current behavior.
>>>
>>> For delete filters associated with data files, at first I thought file
>>> format readers should not care about this. But now I realized that maybe we
>>> need to also push it to file reader, this is useful when `IS_DELETED`
>>> metadata column is not necessary and we could use these filters (position
>>> deletes, etc) to further prune data.
>>>
>>> But anyway, I agree that we could postpone it in follow up pr.
>>>
>>> 2. *Batch size configuration*
>>>
>>> I'm leaning toward option 2.
>>>
>>> 3. *Spark configuration*
>>>
>>> I'm leaning towards using different configuration objects.
>>>
>>>
>>>
>>> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry <peter.vary.apa...@gmail.com>
>>> wrote:
>>>
>>>> Hi Team,
>>>> Thanks everyone for the reviews on
>>>> https://github.com/apache/iceberg/pull/12298!
>>>> I have addressed most of comments, but a few questions still remain
>>>> which might merit a bit wider audience:
>>>>
>>>>    1. We should decide on the expected filtering behavior when the
>>>>    filters are pushed down to the readers. Currently the filters are 
>>>> applied
>>>>    as best effort for the file format readers. Some readers (Avro) just 
>>>> skip
>>>>    them altogether. There was a suggestion on the PR that we might enforce
>>>>    more strict requirements and the readers either reject part of the 
>>>> filters,
>>>>    or they could apply them fully.
>>>>    2. Batch sizes are currently parameters for the reader builders
>>>>    which could be set for non-vectorized readers too which could be 
>>>> confusing.
>>>>    3. Currently the Spark batch reader uses different configuration
>>>>    objects for ParquetBatchReadConf and OrcBatchReadConf as requested by 
>>>> the
>>>>    reviewers of the Comet PR. There was a suggestion on the current PR to 
>>>> use
>>>>    a common configuration instead.
>>>>
>>>> I would be interested in hearing your thoughts about these topics.
>>>>
>>>> My current take:
>>>>
>>>>    1. *File format filters*: I am leaning towards keeping the current
>>>>    laninet behavior. Especially since Bloom filters are not able to do a 
>>>> full
>>>>    filtering, and are often used as a way to filter out unwanted records.
>>>>    Another option would be to implement a secondary filtering inside the 
>>>> file
>>>>    formats themselves which I think would cause extra complexity, and 
>>>> possible
>>>>    code duplication. Whatever the decision here, I would suggest moving 
>>>> this
>>>>    out to a next PR as the current changeset is big enough as it is.
>>>>    2. *Batch size configuration*: Currently this is the only property
>>>>    which is different in the batch readers and the non-vectorized readers. 
>>>> I
>>>>    see 3 possible solutions:
>>>>       1. Create different builders for vectorized and non-vectorized
>>>>       reads - I don't think the current solution is confusing enough to 
>>>> worth the
>>>>       extra class
>>>>       2. We could put this to the reader configuration property set -
>>>>       This could work, but "hide" the possible configuration mode which is 
>>>> valid
>>>>       for both Parquet and ORC readers
>>>>       3. We could keep things as it is now - I would chose this one,
>>>>       but I don't have a strong opinion here
>>>>    3. *Spark configuration*: TBH, I'm open to bot solution and happy
>>>>    to move to the direction the community decides on
>>>>
>>>> Thanks,
>>>> Peter
>>>>
>>>> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. márc.
>>>> 14., P, 16:31):
>>>>
>>>>> Hi Peter
>>>>>
>>>>> Thanks for the update. I will do a new pass on the PR.
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry <
>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>> >
>>>>> > Hi Team,
>>>>> > I have rebased the File Format API proposal (
>>>>> https://github.com/apache/iceberg/pull/12298) to include the new
>>>>> changes needed for the Variant types. I would love to hear your feedback,
>>>>> especially Dan and Ryan, as you were the most active during our
>>>>> discussions. If I can help in any way to make the review easier, please 
>>>>> let
>>>>> me know.
>>>>> > Thanks,
>>>>> > Peter
>>>>> >
>>>>> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025.
>>>>> febr. 28., P, 17:50):
>>>>> >>
>>>>> >> Hi everyone,
>>>>> >> Thanks for all of the actionable, relevant feedback on the PR (
>>>>> https://github.com/apache/iceberg/pull/12298).
>>>>> >> Updated the code to address most of them. Please check if you agree
>>>>> with the general approach.
>>>>> >> If there is a consensus about the general approach, I could.
>>>>> separate out the PR to smaller pieces so we can have an easier time to
>>>>> review and merge those step-by-step.
>>>>> >> Thanks,
>>>>> >> Peter
>>>>> >>
>>>>> >> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025.
>>>>> febr. 20., Cs, 14:14):
>>>>> >>>
>>>>> >>> Hi Peter
>>>>> >>>
>>>>> >>> sorry for the late reply on this.
>>>>> >>>
>>>>> >>> I did a pass on the proposal, it's very interesting and well
>>>>> written.
>>>>> >>> I like the DataFile API and definitely worth to discuss all
>>>>> together.
>>>>> >>>
>>>>> >>> Maybe we can schedule a specific meeting to discuss about DataFile
>>>>> API ?
>>>>> >>>
>>>>> >>> Thoughts ?
>>>>> >>>
>>>>> >>> Regards
>>>>> >>> JB
>>>>> >>>
>>>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry <
>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>> >>> >
>>>>> >>> > Hi Team,
>>>>> >>> >
>>>>> >>> > As mentioned earlier on our Community Sync I am exploring the
>>>>> possibility to define a FileFormat API for accessing different file
>>>>> formats. I have put together a proposal based on my findings.
>>>>> >>> >
>>>>> >>> > -------------------
>>>>> >>> > Iceberg currently supports 3 different file formats: Avro,
>>>>> Parquet, ORC. With the introduction of Iceberg V3 specification many new
>>>>> features are added to Iceberg. Some of these features like new column
>>>>> types, default values require changes at the file format level. The 
>>>>> changes
>>>>> are added by individual developers with different focus on the different
>>>>> file formats. As a result not all of the features are available for every
>>>>> supported file format.
>>>>> >>> > Also there are emerging file formats like Vortex [1] or Lance
>>>>> [2] which either by specialization, or by applying newer research results
>>>>> could provide better alternatives for certain use-cases like random access
>>>>> for data, or storing ML models.
>>>>> >>> > -------------------
>>>>> >>> >
>>>>> >>> > Please check the detailed proposal [3] and the google document
>>>>> [4], and comment there or reply on the dev list if you have any 
>>>>> suggestions.
>>>>> >>> >
>>>>> >>> > Thanks,
>>>>> >>> > Peter
>>>>> >>> >
>>>>> >>> > [1] - https://github.com/spiraldb/vortex
>>>>> >>> > [2] - https://lancedb.github.io/lance/
>>>>> >>> > [3] - https://github.com/apache/iceberg/issues/12225
>>>>> >>> > [4] -
>>>>> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds
>>>>> >>> >
>>>>>
>>>>

Re: [DISCUSS] FileFormat API proposal

Reply via email to