Re: [DISCUSS] FileFormat API proposal

Péter Váry Mon, 15 Sep 2025 08:36:33 -0700

Thanks for the feedback @Russell and @Renjie!

Updated the PR accordingly.
Also removed the possibility to set the row schema for the position delete
writer. We will not need that after the PDWR deprecation.


You can see one possible implementation in
https://github.com/apache/iceberg/pull/12298 - we can discuss that
separately. I just made sure that the new API is able to serve all of the
current needs.

@Ryan: What are your thoughts?

Are we in a stage when we can vote on the current API?

Thanks,
Peter


Renjie Liu <[email protected]> ezt írta (időpont: 2025. szept. 15.,
H, 12:08):

> I would also vote for option 0. This api has clean separation and makes
> refactoring easier, e.g. when we completely deprecate v2 table, we could
> mark the *positionDeleteWriteBuilder *method as deprecated, and it would
> be easier to remove its usage.
>
> On Fri, Sep 12, 2025 at 11:24 PM Russell Spitzer <
> [email protected]> wrote:
>
>> Now that I fully understand the situation I think option 0 as you've
>> written is probably the best thing to do as long as PositionDelete is a
>> class. I
>> think with hindsight it probably shouldn't have been a class and always
>> been an interface so that our internal code could produce rows which
>> implement PositionDelete rather than PositionDeletes that wrap rows.
>>
>> On Fri, Sep 12, 2025 at 8:02 AM Péter Váry <[email protected]>
>> wrote:
>>
>>> Let me summarize the state a bit:
>>>
>>> The FileFormat interface needs to expose two distinct methods:
>>>
>>>    - WriteBuilder<InternalRow>
>>>    - WriteBuilder<PositionDelete<InternalRow>>
>>>       - After the PDWR deprecation this will be
>>>       WriteBuilder<PositionDelete>
>>>       - After V2 deprecation this will be not needed anymore
>>>
>>> Based on the file format methods, the Registry must support four builder
>>> types:
>>>
>>>    - WriteBuilder<InternalRow>
>>>    - DataWriteBuilder<InternalRow>
>>>    - EqualityDeleteWriteBuilder<InternalRow>
>>>    - PositionDeleteWriteBuilder<InternalRow>
>>>
>>>
>>> *API Design Considerations*
>>> There is an argument that the two WriteBuilder methods provided by
>>> FileFormat are essentially the same, differing only in the writerFunction.
>>> While this is technically correct for current implementations, I believe
>>> the API should clearly distinguish between the two writer types to
>>> highlight the differences.
>>>
>>> *Discussed Approaches*
>>>
>>> *0. Two Explicit Methods on FormatModel* (removed based on previous
>>> comments, but I personally still prefer this)
>>>
>>> *WriteBuilder<InternalRow> writeBuilder(OutputFile outputFile);*
>>> *WriteBuilder<PositionDelete<InternalRow>>
>>> positionDeleteWriteBuilder(OutputFile outputFile); *
>>>
>>>
>>> Pros: Clear separation of responsibilities
>>>
>>> *1. One Builder + One Converter*
>>>
>>> *WriteBuilder<InternalRow> writeBuilder(OutputFile outputFile);*
>>> *Function<PositionDelete<D>, D> positionDeleteConverter(Schema schema);*
>>>
>>>
>>> Pros: Keeps the interface compact
>>> Cons: Requires additional documentation and understanding why the
>>> conversion logic is needed
>>>
>>> *2. Single Method with Javadoc Clarification* (most similar to the
>>> current approach)
>>>
>>> *WriteBuilder writeBuilder(OutputFile outputFile); *
>>>
>>>
>>> Pros: Minimalistic
>>> Cons: Least explicit; relies entirely on documentation
>>>
>>> *2/b. Single Builder with Type Parameter *(based on Russell's
>>> suggestion)
>>>
>>> *WriteBuilder writeBuilder(OutputFile outputFile);*
>>> *// Usage: builder.build(Class<D> inputType)*
>>>
>>>
>>> Pros: Flexible
>>> Cons: Relies on documentation to clarify the available input types
>>>
>>> *Bonus*
>>> Options 0 and 1 make it easier to phase out PositionDelete filtering
>>> once V2 tables are deprecated.
>>>
>>> Thanks,
>>> Peter
>>>
>>> Péter Váry <[email protected]> ezt írta (időpont: 2025.
>>> szept. 11., Cs, 18:36):
>>>
>>>> > Wouldn't PositionDelete<InternalRow> also be an InternalRow in this
>>>> example? I think that's what i'm confused about.
>>>>
>>>> With the *second approach*, the WriteBuilder doesn’t need to handle
>>>> PositionDelete objects directly. The conversion layer takes care of
>>>> that, so the WriteBuilder only needs to work with InternalRow.
>>>>
>>>> With the *first approach*, we shift that responsibility to the
>>>> WriteBuilder, which then has to support both InternalRow and
>>>> PositionDelete<InternalRow>.
>>>>
>>>> In both cases, the FormatModelRegistry API will still expose the more
>>>> concrete types (PositionDelete / InternalRow). However, under the *first
>>>> approach*, the lower-level API only needs to handle InternalRow,
>>>> simplifying its interface.
>>>> Thanks,
>>>> Peter
>>>>
>>>> Russell Spitzer <[email protected]> ezt írta (időpont: 2025.
>>>> szept. 11., Cs, 17:12):
>>>>
>>>>> Wouldn't PositionDelete<InternalRow> also be an InternalRow in this
>>>>> example? I think that's what i'm confused about.
>>>>>
>>>>> On Thu, Sep 11, 2025 at 5:35 AM Péter Váry <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Thanks, Russell, for taking a look at this!
>>>>>>
>>>>>> We need to expose four methods on the user-facing API (
>>>>>> FormatModelRegistry):
>>>>>>
>>>>>>    1. *writeBuilder* – for writing arbitrary files without Iceberg
>>>>>>    metadata. In the Iceberg codebase, this is exposed via 
>>>>>> FlinkAppenderFactory and
>>>>>>    the GenericAppenderFactory for creating FileAppender<RowData> and
>>>>>>    FileAppender<Record> only.
>>>>>>    2. *dataWriteBuilder* – for creating and collecting metadata for
>>>>>>    Iceberg DataFiles.
>>>>>>    3. *equalityDeleteWriteBuilder* – for creating and collecting
>>>>>>    metadata for Iceberg EqualityDeleteFiles.
>>>>>>    4. *positionDeleteWriteBuilder* – for creating and collecting
>>>>>>    metadata for Iceberg PositionDeleteFiles.
>>>>>>
>>>>>> We’d like to implement all four using a single WriteBuilder created
>>>>>> by the FormatModels.
>>>>>>
>>>>>> Your suggestion is a good one—it helps formalize the requirements for
>>>>>> the build method and also surfaces an important design question:
>>>>>>
>>>>>> *Who should be responsible for handling the differences between
>>>>>> normal rows (InternalRow) and position deletes
>>>>>> (PositionDelete<InternalRow>)*?
>>>>>>
>>>>>>    - Should we have a more complex WriteBuilder class that can
>>>>>>    create both DataFileAppender and PositionDeleteAppender?
>>>>>>    - Or should we push this responsibility to the engine-specific
>>>>>>    code, where we already have some logic (e.g., pathTransformFunc)
>>>>>>    needed by each engine to create the PositionDeleteAppender?
>>>>>>
>>>>>> Thanks,
>>>>>> Peter
>>>>>>
>>>>>>
>>>>>> Russell Spitzer <[email protected]> ezt írta (időpont: 2025.
>>>>>> szept. 11., Cs, 0:11):
>>>>>>
>>>>>>> I'm a little confused here, I think Ryan mentioned this in the
>>>>>>> comment here
>>>>>>> https://github.com/apache/iceberg/pull/12774/files#r2254967177
>>>>>>>
>>>>>>> From my understanding there are two options?
>>>>>>>
>>>>>>> 1) We either are producing FormatModels that take a generic row type
>>>>>>> D and produce writers that all take D and write files.
>>>>>>>
>>>>>>> 2) we are creating IcebergModel specific writers that take DataFile,
>>>>>>> PositionDeleteFile, EqualityDeleteFile etc ... and write files
>>>>>>>
>>>>>>> The PositionDelete Converter issue seems to stem from attempting to
>>>>>>> do both model 1 (being very generic) and 2, wanting special code to deal
>>>>>>> with PositionDeleteFile<R> objects.
>>>>>>>
>>>>>>> It looks like the code in #12774 is mostly doing model 1, but we are
>>>>>>> trying to add in a specific converter for 2?
>>>>>>>
>>>>>>> Maybe i'm totally lost here but I was assuming we would do
>>>>>>> something a little scala-y like
>>>>>>>
>>>>>>> public <T> FileAppender<T> build(Class<T> type) {
>>>>>>> if (type == DataFile.class) return (FileAppender<T>) new
>>>>>>> DataFileAppender();
>>>>>>> if (type == DeleteFile.class) return (FileAppender<T>) new
>>>>>>> DeleteFileAppender();
>>>>>>> // ...
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> So that we only register a single signature and if writer specific
>>>>>>> implementation needs to do something special it can? I'm trying to catch
>>>>>>> back up to speed on this PR so it may help to do a quick summary of the
>>>>>>> current state and intent. (At least for me)
>>>>>>>
>>>>>>> On Tue, Sep 9, 2025 at 3:42 AM Péter Váry <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Renjie,
>>>>>>>> Thanks for taking a look!
>>>>>>>>
>>>>>>>> Let me clarify a few points:
>>>>>>>> - The converter API is only required for writing position delete
>>>>>>>> files for V2 tables
>>>>>>>> - Currently, there are no plans to support vectorized writing via
>>>>>>>> the java API
>>>>>>>> - Even if we decide to support vectorized writes, I don't think we
>>>>>>>> would like to implement it for Positional Deletes, which are 
>>>>>>>> deprecated in
>>>>>>>> the new spec.
>>>>>>>> - Also, once the positional deletes - which contain the deleted
>>>>>>>> rows - are deprecated (as planned), the conversion of the Position 
>>>>>>>> Deletes
>>>>>>>> with only file name and position would be trivial, even for the 
>>>>>>>> vectorized
>>>>>>>> writes.
>>>>>>>>
>>>>>>>> So from my perspective, the converter method exists purely for
>>>>>>>> backward compatibility, and we intend to remove it as soon as possible.
>>>>>>>> Sacrificing good practices for the sake of a deprecated feature doesn’t
>>>>>>>> seem worthwhile to me.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Peter
>>>>>>>>
>>>>>>>> Renjie Liu <[email protected]> ezt írta (időpont: 2025.
>>>>>>>> szept. 8., H, 12:34):
>>>>>>>>
>>>>>>>>> Hi, Peter:
>>>>>>>>>
>>>>>>>>> I would vote for the first approach. In spite of the compromises
>>>>>>>>> described, the api is still cleaner. Also I think there are some 
>>>>>>>>> problems
>>>>>>>>> with the converter api. For example, for vectorized implementations 
>>>>>>>>> such as
>>>>>>>>> comet which accepts columnar batch rather than rows, the converter 
>>>>>>>>> method
>>>>>>>>> would make things more complicated.
>>>>>>>>>
>>>>>>>>> On Sat, Aug 30, 2025 at 2:49 PM Péter Váry <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> I’ve initiated a discussion thread regarding the deprecation of
>>>>>>>>>> Position Deletes containing row data. You can follow it here:
>>>>>>>>>> https://lists.apache.org/thread/8jw6pb2vq3ghmdqf1yvy8n5n6gg1fq5s
>>>>>>>>>>
>>>>>>>>>> We can proceed with the discussion about the native reader/writer
>>>>>>>>>> deprecation when we decided on the final API, as the chosen design 
>>>>>>>>>> may
>>>>>>>>>> influence our approach.
>>>>>>>>>>
>>>>>>>>>> Since then, one more question has come up - hopefully the last:
>>>>>>>>>> *How should we handle Position Delete Writers?*
>>>>>>>>>> The File Format API should return builders for either rows or
>>>>>>>>>> PositionDelete objects. Currently the method
>>>>>>>>>> `WriteBuilder.createWriterFunc(Function<MessageType,
>>>>>>>>>> ParquetValueWriter<?>>)` defines the accepted input parameters for 
>>>>>>>>>> the
>>>>>>>>>> writer. Users are responsible for ensuring that the writer function 
>>>>>>>>>> and the
>>>>>>>>>> return type of the `WriteBuilder.build()` are compatible. In the new 
>>>>>>>>>> API,
>>>>>>>>>> we no longer expose writer functions. We still expose FileContent, 
>>>>>>>>>> since
>>>>>>>>>> writer configurations vary by content type, but we don’t expose the 
>>>>>>>>>> types.
>>>>>>>>>>
>>>>>>>>>> There are two proposals for handling types for the WriteBuilders:
>>>>>>>>>>
>>>>>>>>>>    1. *Implicit Type Definition via FileContent* - the builder
>>>>>>>>>>    parameter for FileContent would implicitly define the input type 
>>>>>>>>>> for the
>>>>>>>>>>    writer returned by build(), or
>>>>>>>>>>    2. *Engine level conversion* - Engines would convert
>>>>>>>>>>    PositionDelete objects to their native types.
>>>>>>>>>>
>>>>>>>>>> In code:
>>>>>>>>>>
>>>>>>>>>>    - In the 1st proposal, the
>>>>>>>>>>    FormatModel.writeBuilder(OutputFile outputFile) can return 
>>>>>>>>>> anything:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *    WriteBuilder builder =
>>>>>>>>>>    FormatModelRegistry.writeBuilder(PARQUET, InternalRow.class, 
>>>>>>>>>> outputFile);
>>>>>>>>>>      FileAppender<InternalRow> appender =         
>>>>>>>>>> .schema(table.schema())
>>>>>>>>>>        .content(FileContent.DATA)         ....         .build(); *
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *   // Exposed, but
>>>>>>>>>>    FormatModelRegistry.positionDeleteWriteBuilder should be used 
>>>>>>>>>> instead
>>>>>>>>>>     WriteBuilder builder = FormatModelRegistry.writeBuilder(PARQUET,
>>>>>>>>>>    InternalRow.class, outputFile);
>>>>>>>>>>     FileAppender<PositionDelete<InternalRow>> appender =
>>>>>>>>>>    .schema(table.schema())         
>>>>>>>>>> .content(FileContent.POSITION_DELETES)
>>>>>>>>>>        ....         .build();*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - In the 2nd proposal, the FormatModel needs another method:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Function<PositionDelete<D>, D> positionDeleteConverter(Schema
>>>>>>>>>> schema);    *example implementation:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *    return delete -> {      deleteRecord.update(0,
>>>>>>>>>> UTF8String.fromString(delete.path().toString()));
>>>>>>>>>> deleteRecord.update(1, delete.pos());      deleteRecord.update(2,
>>>>>>>>>> delete.row());      return deleteRecord;    };*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *    // Content is only used for writer property configuration
>>>>>>>>>>      WriteBuilder<InternalRow> builder =
>>>>>>>>>>    sparkFormatModel.writeBuilder(outputFile);
>>>>>>>>>>      FileAppender<InternalRow> appender =         
>>>>>>>>>> .schema(table.schema())
>>>>>>>>>>        .content(FileContent.DATA)         ....         .build();*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Drawbacks
>>>>>>>>>>
>>>>>>>>>>    - Proposal 1:
>>>>>>>>>>       - Type checking for the FileAppenders occurs only at
>>>>>>>>>>       runtime, so user errors surface late.
>>>>>>>>>>       - File Format specification must clearly specify which
>>>>>>>>>>       builder type corresponds to which file content 
>>>>>>>>>> parameter—generics would
>>>>>>>>>>       offer better clarity.
>>>>>>>>>>       - Inconsistent patterns between WriteBuilder and
>>>>>>>>>>       ReadBuilder, as the latter can define output types via 
>>>>>>>>>> generics.
>>>>>>>>>>    - Proposal 2:
>>>>>>>>>>       - Requires FormatModels to implement a converter method to
>>>>>>>>>>       transform PositionDelete<InternalRow> into InternalRow.
>>>>>>>>>>
>>>>>>>>>> Since we deprecated writing position delete files in the V3 spec,
>>>>>>>>>> this extra method in the 2nd proposal will be deprecated too. As a 
>>>>>>>>>> result,
>>>>>>>>>> in the long run, we will have a nice, clean API.
>>>>>>>>>> OTOH, if we accept the compromise described in the 1st proposal,
>>>>>>>>>> the results of our decision will remain, even when the functions are
>>>>>>>>>> removed.
>>>>>>>>>>
>>>>>>>>>> Looking forward to your thoughts.
>>>>>>>>>> Thanks, Peter
>>>>>>>>>>
>>>>>>>>>> On Thu, Aug 14, 2025, 14:12 Péter Váry <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Team,
>>>>>>>>>>>
>>>>>>>>>>> During yesterday’s community sync, we discussed the current
>>>>>>>>>>> state of the File Format API proposal and identified two key 
>>>>>>>>>>> questions that
>>>>>>>>>>> require input from the broader community:
>>>>>>>>>>>
>>>>>>>>>>> *1. Dropping support for Position Delete files with Row Data*
>>>>>>>>>>>
>>>>>>>>>>> The current Iceberg V2 spec [1] defines two types of position
>>>>>>>>>>> delete files:
>>>>>>>>>>>
>>>>>>>>>>>    - Files that store only the file name and row position.
>>>>>>>>>>>    - Files that also store the deleted row data.
>>>>>>>>>>>
>>>>>>>>>>> Although this feature is defined in the spec and some tests
>>>>>>>>>>> exist in the Iceberg codebase, we’re not aware of any actual 
>>>>>>>>>>> implementation
>>>>>>>>>>> using the second type (with row data). Supporting V2 table writing 
>>>>>>>>>>> via the
>>>>>>>>>>> new File Format API would be simpler if we dropped support for this 
>>>>>>>>>>> feature.
>>>>>>>>>>> If you know of any use case or reason to retain support for
>>>>>>>>>>> position deletes with row data, please let us know.
>>>>>>>>>>>
>>>>>>>>>>> *2. Deprecating Native File Format Readers/Writers in the API*
>>>>>>>>>>>
>>>>>>>>>>> The current API contains format-specific readers/writers for
>>>>>>>>>>> Parquet, Avro, and ORC. With the introduction of the InternalData 
>>>>>>>>>>> and File
>>>>>>>>>>> Format APIs, Iceberg users can now write files using:
>>>>>>>>>>>
>>>>>>>>>>>    - InternalData API for metadata files (manifest, manifest
>>>>>>>>>>>    list, partition stats).
>>>>>>>>>>>    - File Format API for data and delete files.
>>>>>>>>>>>
>>>>>>>>>>> I propose we deprecate the original format-specific writers and
>>>>>>>>>>> guide users to use the new APIs based on the target file type. If 
>>>>>>>>>>> you’re
>>>>>>>>>>> aware of any use cases that still require the original 
>>>>>>>>>>> format-specific
>>>>>>>>>>> writers, please share them.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Peter
>>>>>>>>>>>
>>>>>>>>>>> [1] - Position Delete File Spec:
>>>>>>>>>>> https://iceberg.apache.org/spec/?h=delete#position-delete-files
>>>>>>>>>>>
>>>>>>>>>>> Péter Váry <[email protected]> ezt írta (időpont:
>>>>>>>>>>> 2025. júl. 22., K, 16:09):
>>>>>>>>>>>
>>>>>>>>>>>> Also put together a solution where the Engine specific format
>>>>>>>>>>>> transformation is separated from the writer, and the engines need 
>>>>>>>>>>>> to take
>>>>>>>>>>>> care of it separately.
>>>>>>>>>>>> This is somewhat complicated on the implementation side (see:
>>>>>>>>>>>> [RowDataTransformer](
>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298/files#diff-562fa4cc369c908a157f59a9235fd3f389096451e7901686fba37c87b53dee08),
>>>>>>>>>>>> and [InternalRowTransformer](
>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298/files#diff-546f9dc30e3207d1d2bc0a2722976b55f5a04dcf85a22855e4f400500c317140)),
>>>>>>>>>>>> but simplifies the API.
>>>>>>>>>>>>
>>>>>>>>>>>> @rdblue: Please check the proposed solution. I think this is
>>>>>>>>>>>> what you have suggested
>>>>>>>>>>>>
>>>>>>>>>>>> Péter Váry <[email protected]> ezt írta (időpont:
>>>>>>>>>>>> 2025. jún. 30., H, 18:42):
>>>>>>>>>>>>
>>>>>>>>>>>>> During the PR review [1], we began exploring what could we use
>>>>>>>>>>>>> as an intermediate layer to reduce the need for engines and file 
>>>>>>>>>>>>> formats to
>>>>>>>>>>>>> implement the full matrix of file format - object model 
>>>>>>>>>>>>> conversions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> To support this discussion, I’ve created and run a set of
>>>>>>>>>>>>> performance benchmarks and compiled a document outlining the 
>>>>>>>>>>>>> potential
>>>>>>>>>>>>> benefits and trade-offs [2].
>>>>>>>>>>>>>
>>>>>>>>>>>>> Feedback is welcome, feel free to comment on the document, the
>>>>>>>>>>>>> PR, or directly in this thread.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1] - PR discussion -
>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774#discussion_r2093626096
>>>>>>>>>>>>> [2] - File Format and engine object model transformation
>>>>>>>>>>>>> performance -
>>>>>>>>>>>>> https://docs.google.com/document/d/1GdA8IowKMtS3QVdm8s-0X-ZRYetcHv2bhQ9mrSd3fd4
>>>>>>>>>>>>>
>>>>>>>>>>>>> Péter Váry <[email protected]> ezt írta (időpont:
>>>>>>>>>>>>> 2025. máj. 7., Sze, 13:15):
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>> The proposed API part is reviewed and ready to go. See:
>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774
>>>>>>>>>>>>>> Thanks to everyone who reviewed it already!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Many of you wanted to review, but I know that the time
>>>>>>>>>>>>>> constraints are there for everyone. I still very much would like 
>>>>>>>>>>>>>> to hear
>>>>>>>>>>>>>> your voices, so I will not merge the PR this week. Please review 
>>>>>>>>>>>>>> it if you.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Péter Váry <[email protected]> ezt írta (időpont:
>>>>>>>>>>>>>> 2025. ápr. 16., Sze, 7:02):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Renjie,
>>>>>>>>>>>>>>> The first one for the proposed new API is here:
>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774
>>>>>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Apr 16, 2025, 05:40 Renjie Liu <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi, Peter:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for the effort. I totally agree with splitting them
>>>>>>>>>>>>>>>> into smaller prs to move forward.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm quite interested in this topic, and please ping me in
>>>>>>>>>>>>>>>> those splitted prs and I'll help to review.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Apr 14, 2025 at 11:22 PM Jean-Baptiste Onofré <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Peter
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Awesome ! Thank you so much !
>>>>>>>>>>>>>>>>> I will do a new pass.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>> JB
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2025 at 3:48 PM Péter Váry <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > Hi JB,
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > Separated out the proposed interfaces to a new PR:
>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774.
>>>>>>>>>>>>>>>>> > Reviewers can check that out if they are only interested
>>>>>>>>>>>>>>>>> in how the new API would look like.
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > Thanks,
>>>>>>>>>>>>>>>>> > Peter
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > Jean-Baptiste Onofré <[email protected]> ezt írta
>>>>>>>>>>>>>>>>> (időpont: 2025. ápr. 10., Cs, 18:25):
>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>> >> Hi Peter
>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>> >> Thanks for the ping about the PR.
>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>> >> Maybe, to facilitate the review and move forward
>>>>>>>>>>>>>>>>> faster, we should
>>>>>>>>>>>>>>>>> >> split the PR in smaller PRs:
>>>>>>>>>>>>>>>>> >> - one with the interfaces (ReadBuilder,
>>>>>>>>>>>>>>>>> AppenderBuilder, ObjectModel,
>>>>>>>>>>>>>>>>> >> AppenderBuilder, DataWriterBuilder, ...)
>>>>>>>>>>>>>>>>> >> - one for each file providers (Parquet, Avro, ORC)
>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>> >> Thoughts ? I can help on the split if needed.
>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>> >> Regards
>>>>>>>>>>>>>>>>> >> JB
>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>> >> On Thu, Apr 10, 2025 at 5:16 AM Péter Váry <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>> >> > Since the 1.9.0 release candidate has been created, I
>>>>>>>>>>>>>>>>> would like to resurrect this PR:
>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298 to ensure
>>>>>>>>>>>>>>>>> that we have as long a testing period as possible for it.
>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>> >> > To recap, here is what the PR does after the review
>>>>>>>>>>>>>>>>> rounds:
>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>> >> > Created 3 interface classes which are implemented by
>>>>>>>>>>>>>>>>> the file formats:
>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>> >> > ReadBuilder - Builder for reading data from data files
>>>>>>>>>>>>>>>>> >> > AppenderBuilder - Builder for writing data to data
>>>>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>>> >> > ObjectModel - Providing ReadBuilders, and
>>>>>>>>>>>>>>>>> AppenderBuilders for the specific data file format and object 
>>>>>>>>>>>>>>>>> model pair
>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>> >> > Updated the Parquet, Avro, ORC implementation for
>>>>>>>>>>>>>>>>> this interfaces, and deprecated the old reader/writer APIs
>>>>>>>>>>>>>>>>> >> > Created interface classes which will be used by the
>>>>>>>>>>>>>>>>> actual readers/writers of the data files:
>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>> >> > AppenderBuilder - Builder for writing a file
>>>>>>>>>>>>>>>>> >> > DataWriterBuilder - Builder for generating a data file
>>>>>>>>>>>>>>>>> >> > PositionDeleteWriterBuilder - Builder for generating
>>>>>>>>>>>>>>>>> a position delete file
>>>>>>>>>>>>>>>>> >> > EqualityDeleteWriterBuilder - Builder for generating
>>>>>>>>>>>>>>>>> an equality delete file
>>>>>>>>>>>>>>>>> >> > No ReadBuilder here - the file format reader builder
>>>>>>>>>>>>>>>>> is reused
>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>> >> > Created a WriterBuilder class which implements the
>>>>>>>>>>>>>>>>> interfaces above
>>>>>>>>>>>>>>>>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>>>>>>>>>>>>>>>>> based on a provided file format specific AppenderBuilder
>>>>>>>>>>>>>>>>> >> > Created an ObjectModelRegistry which stores the
>>>>>>>>>>>>>>>>> available ObjectModels, and engines and users could request 
>>>>>>>>>>>>>>>>> the readers
>>>>>>>>>>>>>>>>> (ReadBuilder) and writers
>>>>>>>>>>>>>>>>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>>>>>>>>>>>>>>>>> from.
>>>>>>>>>>>>>>>>> >> > Created the appropriate ObjectModels:
>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>> >> > GenericObjectModels - for reading and writing Iceberg
>>>>>>>>>>>>>>>>> Records
>>>>>>>>>>>>>>>>> >> > SparkObjectModels - for reading (vectorized and
>>>>>>>>>>>>>>>>> non-vectorized) and writing Spark InternalRow/ColumnarBatch 
>>>>>>>>>>>>>>>>> objects
>>>>>>>>>>>>>>>>> >> > FlinkObjectModels - for reading and writing Flink
>>>>>>>>>>>>>>>>> RowData objects
>>>>>>>>>>>>>>>>> >> > An arrow object model is also registered for
>>>>>>>>>>>>>>>>> vectorized reads of Parquet files into Arrow ColumnarBatch 
>>>>>>>>>>>>>>>>> objects
>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>> >> > Updated the production code where the reading and
>>>>>>>>>>>>>>>>> writing happens to use the ObjectModelRegistry and the new 
>>>>>>>>>>>>>>>>> reader/writer
>>>>>>>>>>>>>>>>> interfaces to access data files
>>>>>>>>>>>>>>>>> >> > Kept the testing code intact to ensure that the new
>>>>>>>>>>>>>>>>> API/code is not breaking anything
>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>> >> > The original change was not small, and grew
>>>>>>>>>>>>>>>>> substantially during the review rounds. So if you have 
>>>>>>>>>>>>>>>>> questions, or I can
>>>>>>>>>>>>>>>>> do anything to make the review easier, don't hesitate to ask. 
>>>>>>>>>>>>>>>>> I am happy to
>>>>>>>>>>>>>>>>> do anything to move this forward.
>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>> >> > Thanks,
>>>>>>>>>>>>>>>>> >> > Peter
>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>> >> > Péter Váry <[email protected]> ezt írta
>>>>>>>>>>>>>>>>> (időpont: 2025. márc. 26., Sze, 14:54):
>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>> >> >> Hi everyone,
>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>> >> >> I have updated the File Format API PR (
>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298) based on
>>>>>>>>>>>>>>>>> the answers and review comments.
>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>> >> >> I would like to merge this only after the 1.9.0
>>>>>>>>>>>>>>>>> release so we have more time finding any issues and solving 
>>>>>>>>>>>>>>>>> them before
>>>>>>>>>>>>>>>>> this goes to a release for the users.
>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>> >> >> For this I have updated the deprecation comments
>>>>>>>>>>>>>>>>> accordingly.
>>>>>>>>>>>>>>>>> >> >> I would like to ask you to review the PR, so we iron
>>>>>>>>>>>>>>>>> out any possible requested changes and be ready for the merge 
>>>>>>>>>>>>>>>>> as soon as
>>>>>>>>>>>>>>>>> possible after the 1.9.0 release.
>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>> >> >> Thanks,
>>>>>>>>>>>>>>>>> >> >> Peter
>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>> >> >> Péter Váry <[email protected]> ezt írta
>>>>>>>>>>>>>>>>> (időpont: 2025. márc. 21., P, 14:32):
>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>> >> >>> Hi Renije,
>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>> >> >>> > 1. File format filters
>>>>>>>>>>>>>>>>> >> >>> >
>>>>>>>>>>>>>>>>> >> >>> > Do the filters include both filter expressions
>>>>>>>>>>>>>>>>> from both user query and delete filter?
>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>> >> >>> The current discussion is about the filters from
>>>>>>>>>>>>>>>>> the user query.
>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>> >> >>> About the delete filter:
>>>>>>>>>>>>>>>>> >> >>> Based on the suggestions on the PR, I have moved
>>>>>>>>>>>>>>>>> the delete filter out from the main API. Created a 
>>>>>>>>>>>>>>>>> `SupportsDeleteFilter`
>>>>>>>>>>>>>>>>> interface for it which would allow pushing down to the filter 
>>>>>>>>>>>>>>>>> to Parquet
>>>>>>>>>>>>>>>>> vectorized readers in Spark, as this is the only place where 
>>>>>>>>>>>>>>>>> we currently
>>>>>>>>>>>>>>>>> implemented this feature.
>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>> >> >>> Renjie Liu <[email protected]> ezt írta
>>>>>>>>>>>>>>>>> (időpont: 2025. márc. 21., P, 14:11):
>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>> >> >>>> Hi, Peter:
>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>> >> >>>> Thanks for the effort on this.
>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>> >> >>>> 1. File format filters
>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>> >> >>>> Do the filters include both filter expressions
>>>>>>>>>>>>>>>>> from both user query and delete filter?
>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>> >> >>>> For filters from user query, I agree with you that
>>>>>>>>>>>>>>>>> we should keep the current behavior.
>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>> >> >>>> For delete filters associated with data files, at
>>>>>>>>>>>>>>>>> first I thought file format readers should not care about 
>>>>>>>>>>>>>>>>> this. But now I
>>>>>>>>>>>>>>>>> realized that maybe we need to also push it to file reader, 
>>>>>>>>>>>>>>>>> this is useful
>>>>>>>>>>>>>>>>> when `IS_DELETED` metadata column is not necessary and we 
>>>>>>>>>>>>>>>>> could use these
>>>>>>>>>>>>>>>>> filters (position deletes, etc) to further prune data.
>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>> >> >>>> But anyway, I agree that we could postpone it in
>>>>>>>>>>>>>>>>> follow up pr.
>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>> >> >>>> 2. Batch size configuration
>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>> >> >>>> I'm leaning toward option 2.
>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>> >> >>>> 3. Spark configuration
>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>> >> >>>> I'm leaning towards using different configuration
>>>>>>>>>>>>>>>>> objects.
>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>> >> >>>> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>> >> >>>>> Hi Team,
>>>>>>>>>>>>>>>>> >> >>>>> Thanks everyone for the reviews on
>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298!
>>>>>>>>>>>>>>>>> >> >>>>> I have addressed most of comments, but a few
>>>>>>>>>>>>>>>>> questions still remain which might merit a bit wider audience:
>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>> >> >>>>> We should decide on the expected filtering
>>>>>>>>>>>>>>>>> behavior when the filters are pushed down to the readers. 
>>>>>>>>>>>>>>>>> Currently the
>>>>>>>>>>>>>>>>> filters are applied as best effort for the file format 
>>>>>>>>>>>>>>>>> readers. Some
>>>>>>>>>>>>>>>>> readers (Avro) just skip them altogether. There was a 
>>>>>>>>>>>>>>>>> suggestion on the PR
>>>>>>>>>>>>>>>>> that we might enforce more strict requirements and the 
>>>>>>>>>>>>>>>>> readers either
>>>>>>>>>>>>>>>>> reject part of the filters, or they could apply them fully.
>>>>>>>>>>>>>>>>> >> >>>>> Batch sizes are currently parameters for the
>>>>>>>>>>>>>>>>> reader builders which could be set for non-vectorized readers 
>>>>>>>>>>>>>>>>> too which
>>>>>>>>>>>>>>>>> could be confusing.
>>>>>>>>>>>>>>>>> >> >>>>> Currently the Spark batch reader uses different
>>>>>>>>>>>>>>>>> configuration objects for ParquetBatchReadConf and 
>>>>>>>>>>>>>>>>> OrcBatchReadConf as
>>>>>>>>>>>>>>>>> requested by the reviewers of the Comet PR. There was a 
>>>>>>>>>>>>>>>>> suggestion on the
>>>>>>>>>>>>>>>>> current PR to use a common configuration instead.
>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>> >> >>>>> I would be interested in hearing your thoughts
>>>>>>>>>>>>>>>>> about these topics.
>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>> >> >>>>> My current take:
>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>> >> >>>>> File format filters: I am leaning towards keeping
>>>>>>>>>>>>>>>>> the current laninet behavior. Especially since Bloom filters 
>>>>>>>>>>>>>>>>> are not able
>>>>>>>>>>>>>>>>> to do a full filtering, and are often used as a way to filter 
>>>>>>>>>>>>>>>>> out unwanted
>>>>>>>>>>>>>>>>> records. Another option would be to implement a secondary 
>>>>>>>>>>>>>>>>> filtering inside
>>>>>>>>>>>>>>>>> the file formats themselves which I think would cause extra 
>>>>>>>>>>>>>>>>> complexity, and
>>>>>>>>>>>>>>>>> possible code duplication. Whatever the decision here, I 
>>>>>>>>>>>>>>>>> would suggest
>>>>>>>>>>>>>>>>> moving this out to a next PR as the current changeset is big 
>>>>>>>>>>>>>>>>> enough as it
>>>>>>>>>>>>>>>>> is.
>>>>>>>>>>>>>>>>> >> >>>>> Batch size configuration: Currently this is the
>>>>>>>>>>>>>>>>> only property which is different in the batch readers and the
>>>>>>>>>>>>>>>>> non-vectorized readers. I see 3 possible solutions:
>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>> >> >>>>> Create different builders for vectorized and
>>>>>>>>>>>>>>>>> non-vectorized reads - I don't think the current solution is 
>>>>>>>>>>>>>>>>> confusing
>>>>>>>>>>>>>>>>> enough to worth the extra class
>>>>>>>>>>>>>>>>> >> >>>>> We could put this to the reader configuration
>>>>>>>>>>>>>>>>> property set - This could work, but "hide" the possible 
>>>>>>>>>>>>>>>>> configuration mode
>>>>>>>>>>>>>>>>> which is valid for both Parquet and ORC readers
>>>>>>>>>>>>>>>>> >> >>>>> We could keep things as it is now - I would chose
>>>>>>>>>>>>>>>>> this one, but I don't have a strong opinion here
>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>> >> >>>>> Spark configuration: TBH, I'm open to bot
>>>>>>>>>>>>>>>>> solution and happy to move to the direction the community 
>>>>>>>>>>>>>>>>> decides on
>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>> >> >>>>> Thanks,
>>>>>>>>>>>>>>>>> >> >>>>> Peter
>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>> >> >>>>> Jean-Baptiste Onofré <[email protected]> ezt írta
>>>>>>>>>>>>>>>>> (időpont: 2025. márc. 14., P, 16:31):
>>>>>>>>>>>>>>>>> >> >>>>>>
>>>>>>>>>>>>>>>>> >> >>>>>> Hi Peter
>>>>>>>>>>>>>>>>> >> >>>>>>
>>>>>>>>>>>>>>>>> >> >>>>>> Thanks for the update. I will do a new pass on
>>>>>>>>>>>>>>>>> the PR.
>>>>>>>>>>>>>>>>> >> >>>>>>
>>>>>>>>>>>>>>>>> >> >>>>>> Regards
>>>>>>>>>>>>>>>>> >> >>>>>> JB
>>>>>>>>>>>>>>>>> >> >>>>>>
>>>>>>>>>>>>>>>>> >> >>>>>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>> >> >>>>>> >
>>>>>>>>>>>>>>>>> >> >>>>>> > Hi Team,
>>>>>>>>>>>>>>>>> >> >>>>>> > I have rebased the File Format API proposal (
>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298) to include
>>>>>>>>>>>>>>>>> the new changes needed for the Variant types. I would love to 
>>>>>>>>>>>>>>>>> hear your
>>>>>>>>>>>>>>>>> feedback, especially Dan and Ryan, as you were the most 
>>>>>>>>>>>>>>>>> active during our
>>>>>>>>>>>>>>>>> discussions. If I can help in any way to make the review 
>>>>>>>>>>>>>>>>> easier, please let
>>>>>>>>>>>>>>>>> me know.
>>>>>>>>>>>>>>>>> >> >>>>>> > Thanks,
>>>>>>>>>>>>>>>>> >> >>>>>> > Peter
>>>>>>>>>>>>>>>>> >> >>>>>> >
>>>>>>>>>>>>>>>>> >> >>>>>> > Péter Váry <[email protected]> ezt
>>>>>>>>>>>>>>>>> írta (időpont: 2025. febr. 28., P, 17:50):
>>>>>>>>>>>>>>>>> >> >>>>>> >>
>>>>>>>>>>>>>>>>> >> >>>>>> >> Hi everyone,
>>>>>>>>>>>>>>>>> >> >>>>>> >> Thanks for all of the actionable, relevant
>>>>>>>>>>>>>>>>> feedback on the PR (
>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298).
>>>>>>>>>>>>>>>>> >> >>>>>> >> Updated the code to address most of them.
>>>>>>>>>>>>>>>>> Please check if you agree with the general approach.
>>>>>>>>>>>>>>>>> >> >>>>>> >> If there is a consensus about the general
>>>>>>>>>>>>>>>>> approach, I could. separate out the PR to smaller pieces so 
>>>>>>>>>>>>>>>>> we can have an
>>>>>>>>>>>>>>>>> easier time to review and merge those step-by-step.
>>>>>>>>>>>>>>>>> >> >>>>>> >> Thanks,
>>>>>>>>>>>>>>>>> >> >>>>>> >> Peter
>>>>>>>>>>>>>>>>> >> >>>>>> >>
>>>>>>>>>>>>>>>>> >> >>>>>> >> Jean-Baptiste Onofré <[email protected]> ezt
>>>>>>>>>>>>>>>>> írta (időpont: 2025. febr. 20., Cs, 14:14):
>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>> >> >>>>>> >>> Hi Peter
>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>> >> >>>>>> >>> sorry for the late reply on this.
>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>> >> >>>>>> >>> I did a pass on the proposal, it's very
>>>>>>>>>>>>>>>>> interesting and well written.
>>>>>>>>>>>>>>>>> >> >>>>>> >>> I like the DataFile API and definitely worth
>>>>>>>>>>>>>>>>> to discuss all together.
>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>> >> >>>>>> >>> Maybe we can schedule a specific meeting to
>>>>>>>>>>>>>>>>> discuss about DataFile API ?
>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>> >> >>>>>> >>> Thoughts ?
>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>> >> >>>>>> >>> Regards
>>>>>>>>>>>>>>>>> >> >>>>>> >>> JB
>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>> >> >>>>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Hi Team,
>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>> >> >>>>>> >>> > As mentioned earlier on our Community Sync
>>>>>>>>>>>>>>>>> I am exploring the possibility to define a FileFormat API for 
>>>>>>>>>>>>>>>>> accessing
>>>>>>>>>>>>>>>>> different file formats. I have put together a proposal based 
>>>>>>>>>>>>>>>>> on my findings.
>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>> >> >>>>>> >>> > -------------------
>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Iceberg currently supports 3 different
>>>>>>>>>>>>>>>>> file formats: Avro, Parquet, ORC. With the introduction of 
>>>>>>>>>>>>>>>>> Iceberg V3
>>>>>>>>>>>>>>>>> specification many new features are added to Iceberg. Some of 
>>>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>> features like new column types, default values require 
>>>>>>>>>>>>>>>>> changes at the file
>>>>>>>>>>>>>>>>> format level. The changes are added by individual developers 
>>>>>>>>>>>>>>>>> with different
>>>>>>>>>>>>>>>>> focus on the different file formats. As a result not all of 
>>>>>>>>>>>>>>>>> the features
>>>>>>>>>>>>>>>>> are available for every supported file format.
>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Also there are emerging file formats like
>>>>>>>>>>>>>>>>> Vortex [1] or Lance [2] which either by specialization, or by 
>>>>>>>>>>>>>>>>> applying
>>>>>>>>>>>>>>>>> newer research results could provide better alternatives for 
>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>> use-cases like random access for data, or storing ML models.
>>>>>>>>>>>>>>>>> >> >>>>>> >>> > -------------------
>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Please check the detailed proposal [3] and
>>>>>>>>>>>>>>>>> the google document [4], and comment there or reply on the 
>>>>>>>>>>>>>>>>> dev list if you
>>>>>>>>>>>>>>>>> have any suggestions.
>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Thanks,
>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Peter
>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>> >> >>>>>> >>> > [1] - https://github.com/spiraldb/vortex
>>>>>>>>>>>>>>>>> >> >>>>>> >>> > [2] - https://lancedb.github.io/lance/
>>>>>>>>>>>>>>>>> >> >>>>>> >>> > [3] -
>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/12225
>>>>>>>>>>>>>>>>> >> >>>>>> >>> > [4] -
>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds
>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>

Re: [DISCUSS] FileFormat API proposal

Reply via email to