Re: [DISCUSS] FileFormat API proposal

Péter Váry Fri, 29 Aug 2025 23:49:42 -0700

I’ve initiated a discussion thread regarding the deprecation of Position
Deletes containing row data. You can follow it here:
https://lists.apache.org/thread/8jw6pb2vq3ghmdqf1yvy8n5n6gg1fq5s


We can proceed with the discussion about the native reader/writer
deprecation when we decided on the final API, as the chosen design may
influence our approach.

Since then, one more question has come up - hopefully the last:
*How should we handle Position Delete Writers?*
The File Format API should return builders for either rows or
PositionDelete objects. Currently the method
`WriteBuilder.createWriterFunc(Function<MessageType,
ParquetValueWriter<?>>)` defines the accepted input parameters for the
writer. Users are responsible for ensuring that the writer function and the
return type of the `WriteBuilder.build()` are compatible. In the new API,
we no longer expose writer functions. We still expose FileContent, since
writer configurations vary by content type, but we don’t expose the types.

There are two proposals for handling types for the WriteBuilders:

   1. *Implicit Type Definition via FileContent* - the builder parameter
   for FileContent would implicitly define the input type for the writer
   returned by build(), or
   2. *Engine level conversion* - Engines would convert PositionDelete
   objects to their native types.

In code:

   - In the 1st proposal, the FormatModel.writeBuilder(OutputFile
   outputFile) can return anything:








*    WriteBuilder builder = FormatModelRegistry.writeBuilder(PARQUET,
   InternalRow.class, outputFile);     FileAppender<InternalRow> appender =
         .schema(table.schema())         .content(FileContent.DATA)
   ....         .build(); *








*   // Exposed, but FormatModelRegistry.positionDeleteWriteBuilder should
   be used instead    WriteBuilder builder
   = FormatModelRegistry.writeBuilder(PARQUET, InternalRow.class, outputFile);
      FileAppender<PositionDelete<InternalRow>> appender =
   .schema(table.schema())         .content(FileContent.POSITION_DELETES)
       ....         .build();*


   - In the 2nd proposal, the FormatModel needs another method:



*Function<PositionDelete<D>, D> positionDeleteConverter(Schema
schema);    *example
implementation:







*    return delete -> {      deleteRecord.update(0,
UTF8String.fromString(delete.path().toString()));
deleteRecord.update(1, delete.pos());      deleteRecord.update(2,
delete.row());      return deleteRecord;    };*








*    // Content is only used for writer property configuration
   WriteBuilder<InternalRow> builder =
   sparkFormatModel.writeBuilder(outputFile);
     FileAppender<InternalRow> appender =         .schema(table.schema())
       .content(FileContent.DATA)         ....         .build();*


Drawbacks

   - Proposal 1:
      - Type checking for the FileAppenders occurs only at runtime, so user
      errors surface late.
      - File Format specification must clearly specify which builder type
      corresponds to which file content parameter—generics would offer better
      clarity.
      - Inconsistent patterns between WriteBuilder and ReadBuilder, as the
      latter can define output types via generics.
   - Proposal 2:
      - Requires FormatModels to implement a converter method to transform
      PositionDelete<InternalRow> into InternalRow.

Since we deprecated writing position delete files in the V3 spec, this
extra method in the 2nd proposal will be deprecated too. As a result, in
the long run, we will have a nice, clean API.
OTOH, if we accept the compromise described in the 1st proposal, the
results of our decision will remain, even when the functions are removed.

Looking forward to your thoughts.
Thanks, Peter

On Thu, Aug 14, 2025, 14:12 Péter Váry <peter.vary.apa...@gmail.com> wrote:

> Hi Team,
>
> During yesterday’s community sync, we discussed the current state of the
> File Format API proposal and identified two key questions that require
> input from the broader community:
>
> *1. Dropping support for Position Delete files with Row Data*
>
> The current Iceberg V2 spec [1] defines two types of position delete files:
>
>    - Files that store only the file name and row position.
>    - Files that also store the deleted row data.
>
> Although this feature is defined in the spec and some tests exist in the
> Iceberg codebase, we’re not aware of any actual implementation using the
> second type (with row data). Supporting V2 table writing via the new File
> Format API would be simpler if we dropped support for this feature.
> If you know of any use case or reason to retain support for position
> deletes with row data, please let us know.
>
> *2. Deprecating Native File Format Readers/Writers in the API*
>
> The current API contains format-specific readers/writers for Parquet,
> Avro, and ORC. With the introduction of the InternalData and File Format
> APIs, Iceberg users can now write files using:
>
>    - InternalData API for metadata files (manifest, manifest list,
>    partition stats).
>    - File Format API for data and delete files.
>
> I propose we deprecate the original format-specific writers and guide
> users to use the new APIs based on the target file type. If you’re aware of
> any use cases that still require the original format-specific writers,
> please share them.
>
> Thanks,
> Peter
>
> [1] - Position Delete File Spec:
> https://iceberg.apache.org/spec/?h=delete#position-delete-files
>
> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. júl.
> 22., K, 16:09):
>
>> Also put together a solution where the Engine specific format
>> transformation is separated from the writer, and the engines need to take
>> care of it separately.
>> This is somewhat complicated on the implementation side (see:
>> [RowDataTransformer](
>> https://github.com/apache/iceberg/pull/12298/files#diff-562fa4cc369c908a157f59a9235fd3f389096451e7901686fba37c87b53dee08),
>> and [InternalRowTransformer](
>> https://github.com/apache/iceberg/pull/12298/files#diff-546f9dc30e3207d1d2bc0a2722976b55f5a04dcf85a22855e4f400500c317140)),
>> but simplifies the API.
>>
>> @rdblue: Please check the proposed solution. I think this is what you
>> have suggested
>>
>> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. jún.
>> 30., H, 18:42):
>>
>>> During the PR review [1], we began exploring what could we use as an
>>> intermediate layer to reduce the need for engines and file formats to
>>> implement the full matrix of file format - object model conversions.
>>>
>>> To support this discussion, I’ve created and run a set of performance
>>> benchmarks and compiled a document outlining the potential benefits and
>>> trade-offs [2].
>>>
>>> Feedback is welcome, feel free to comment on the document, the PR, or
>>> directly in this thread.
>>>
>>> Thanks,
>>> Peter
>>>
>>> [1] - PR discussion -
>>> https://github.com/apache/iceberg/pull/12774#discussion_r2093626096
>>> [2] - File Format and engine object model transformation performance -
>>> https://docs.google.com/document/d/1GdA8IowKMtS3QVdm8s-0X-ZRYetcHv2bhQ9mrSd3fd4
>>>
>>> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. máj.
>>> 7., Sze, 13:15):
>>>
>>>> Hi everyone,
>>>> The proposed API part is reviewed and ready to go. See:
>>>> https://github.com/apache/iceberg/pull/12774
>>>> Thanks to everyone who reviewed it already!
>>>>
>>>> Many of you wanted to review, but I know that the time constraints are
>>>> there for everyone. I still very much would like to hear your voices, so I
>>>> will not merge the PR this week. Please review it if you.
>>>>
>>>> Thanks,
>>>> Peter
>>>>
>>>> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. ápr.
>>>> 16., Sze, 7:02):
>>>>
>>>>> Hi Renjie,
>>>>> The first one for the proposed new API is here:
>>>>> https://github.com/apache/iceberg/pull/12774
>>>>> Thanks, Peter
>>>>>
>>>>> On Wed, Apr 16, 2025, 05:40 Renjie Liu <liurenjie2...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, Peter:
>>>>>>
>>>>>> Thanks for the effort. I totally agree with splitting them into
>>>>>> smaller prs to move forward.
>>>>>>
>>>>>> I'm quite interested in this topic, and please ping me in those
>>>>>> splitted prs and I'll help to review.
>>>>>>
>>>>>> On Mon, Apr 14, 2025 at 11:22 PM Jean-Baptiste Onofré <
>>>>>> j...@nanthrax.net> wrote:
>>>>>>
>>>>>>> Hi Peter
>>>>>>>
>>>>>>> Awesome ! Thank you so much !
>>>>>>> I will do a new pass.
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>>
>>>>>>> On Fri, Apr 11, 2025 at 3:48 PM Péter Váry <
>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>> >
>>>>>>> > Hi JB,
>>>>>>> >
>>>>>>> > Separated out the proposed interfaces to a new PR:
>>>>>>> https://github.com/apache/iceberg/pull/12774.
>>>>>>> > Reviewers can check that out if they are only interested in how
>>>>>>> the new API would look like.
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > Peter
>>>>>>> >
>>>>>>> > Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025.
>>>>>>> ápr. 10., Cs, 18:25):
>>>>>>> >>
>>>>>>> >> Hi Peter
>>>>>>> >>
>>>>>>> >> Thanks for the ping about the PR.
>>>>>>> >>
>>>>>>> >> Maybe, to facilitate the review and move forward faster, we should
>>>>>>> >> split the PR in smaller PRs:
>>>>>>> >> - one with the interfaces (ReadBuilder, AppenderBuilder,
>>>>>>> ObjectModel,
>>>>>>> >> AppenderBuilder, DataWriterBuilder, ...)
>>>>>>> >> - one for each file providers (Parquet, Avro, ORC)
>>>>>>> >>
>>>>>>> >> Thoughts ? I can help on the split if needed.
>>>>>>> >>
>>>>>>> >> Regards
>>>>>>> >> JB
>>>>>>> >>
>>>>>>> >> On Thu, Apr 10, 2025 at 5:16 AM Péter Váry <
>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>> >> >
>>>>>>> >> > Since the 1.9.0 release candidate has been created, I would
>>>>>>> like to resurrect this PR:
>>>>>>> https://github.com/apache/iceberg/pull/12298 to ensure that we have
>>>>>>> as long a testing period as possible for it.
>>>>>>> >> >
>>>>>>> >> > To recap, here is what the PR does after the review rounds:
>>>>>>> >> >
>>>>>>> >> > Created 3 interface classes which are implemented by the file
>>>>>>> formats:
>>>>>>> >> >
>>>>>>> >> > ReadBuilder - Builder for reading data from data files
>>>>>>> >> > AppenderBuilder - Builder for writing data to data files
>>>>>>> >> > ObjectModel - Providing ReadBuilders, and AppenderBuilders for
>>>>>>> the specific data file format and object model pair
>>>>>>> >> >
>>>>>>> >> > Updated the Parquet, Avro, ORC implementation for this
>>>>>>> interfaces, and deprecated the old reader/writer APIs
>>>>>>> >> > Created interface classes which will be used by the actual
>>>>>>> readers/writers of the data files:
>>>>>>> >> >
>>>>>>> >> > AppenderBuilder - Builder for writing a file
>>>>>>> >> > DataWriterBuilder - Builder for generating a data file
>>>>>>> >> > PositionDeleteWriterBuilder - Builder for generating a position
>>>>>>> delete file
>>>>>>> >> > EqualityDeleteWriterBuilder - Builder for generating an
>>>>>>> equality delete file
>>>>>>> >> > No ReadBuilder here - the file format reader builder is reused
>>>>>>> >> >
>>>>>>> >> > Created a WriterBuilder class which implements the interfaces
>>>>>>> above
>>>>>>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>>>>>>> based on a provided file format specific AppenderBuilder
>>>>>>> >> > Created an ObjectModelRegistry which stores the available
>>>>>>> ObjectModels, and engines and users could request the readers 
>>>>>>> (ReadBuilder)
>>>>>>> and writers
>>>>>>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>>>>>>> from.
>>>>>>> >> > Created the appropriate ObjectModels:
>>>>>>> >> >
>>>>>>> >> > GenericObjectModels - for reading and writing Iceberg Records
>>>>>>> >> > SparkObjectModels - for reading (vectorized and non-vectorized)
>>>>>>> and writing Spark InternalRow/ColumnarBatch objects
>>>>>>> >> > FlinkObjectModels - for reading and writing Flink RowData
>>>>>>> objects
>>>>>>> >> > An arrow object model is also registered for vectorized reads
>>>>>>> of Parquet files into Arrow ColumnarBatch objects
>>>>>>> >> >
>>>>>>> >> > Updated the production code where the reading and writing
>>>>>>> happens to use the ObjectModelRegistry and the new reader/writer 
>>>>>>> interfaces
>>>>>>> to access data files
>>>>>>> >> > Kept the testing code intact to ensure that the new API/code is
>>>>>>> not breaking anything
>>>>>>> >> >
>>>>>>> >> > The original change was not small, and grew substantially
>>>>>>> during the review rounds. So if you have questions, or I can do 
>>>>>>> anything to
>>>>>>> make the review easier, don't hesitate to ask. I am happy to do 
>>>>>>> anything to
>>>>>>> move this forward.
>>>>>>> >> >
>>>>>>> >> > Thanks,
>>>>>>> >> > Peter
>>>>>>> >> >
>>>>>>> >> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont:
>>>>>>> 2025. márc. 26., Sze, 14:54):
>>>>>>> >> >>
>>>>>>> >> >> Hi everyone,
>>>>>>> >> >>
>>>>>>> >> >> I have updated the File Format API PR (
>>>>>>> https://github.com/apache/iceberg/pull/12298) based on the answers
>>>>>>> and review comments.
>>>>>>> >> >>
>>>>>>> >> >> I would like to merge this only after the 1.9.0 release so we
>>>>>>> have more time finding any issues and solving them before this goes to a
>>>>>>> release for the users.
>>>>>>> >> >>
>>>>>>> >> >> For this I have updated the deprecation comments accordingly.
>>>>>>> >> >> I would like to ask you to review the PR, so we iron out any
>>>>>>> possible requested changes and be ready for the merge as soon as 
>>>>>>> possible
>>>>>>> after the 1.9.0 release.
>>>>>>> >> >>
>>>>>>> >> >> Thanks,
>>>>>>> >> >> Peter
>>>>>>> >> >>
>>>>>>> >> >> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont:
>>>>>>> 2025. márc. 21., P, 14:32):
>>>>>>> >> >>>
>>>>>>> >> >>> Hi Renije,
>>>>>>> >> >>>
>>>>>>> >> >>> > 1. File format filters
>>>>>>> >> >>> >
>>>>>>> >> >>> > Do the filters include both filter expressions from both
>>>>>>> user query and delete filter?
>>>>>>> >> >>>
>>>>>>> >> >>> The current discussion is about the filters from the user
>>>>>>> query.
>>>>>>> >> >>>
>>>>>>> >> >>> About the delete filter:
>>>>>>> >> >>> Based on the suggestions on the PR, I have moved the delete
>>>>>>> filter out from the main API. Created a `SupportsDeleteFilter` interface
>>>>>>> for it which would allow pushing down to the filter to Parquet 
>>>>>>> vectorized
>>>>>>> readers in Spark, as this is the only place where we currently 
>>>>>>> implemented
>>>>>>> this feature.
>>>>>>> >> >>>
>>>>>>> >> >>>
>>>>>>> >> >>> Renjie Liu <liurenjie2...@gmail.com> ezt írta (időpont:
>>>>>>> 2025. márc. 21., P, 14:11):
>>>>>>> >> >>>>
>>>>>>> >> >>>> Hi, Peter:
>>>>>>> >> >>>>
>>>>>>> >> >>>> Thanks for the effort on this.
>>>>>>> >> >>>>
>>>>>>> >> >>>> 1. File format filters
>>>>>>> >> >>>>
>>>>>>> >> >>>> Do the filters include both filter expressions from both
>>>>>>> user query and delete filter?
>>>>>>> >> >>>>
>>>>>>> >> >>>> For filters from user query, I agree with you that we should
>>>>>>> keep the current behavior.
>>>>>>> >> >>>>
>>>>>>> >> >>>> For delete filters associated with data files, at first I
>>>>>>> thought file format readers should not care about this. But now I 
>>>>>>> realized
>>>>>>> that maybe we need to also push it to file reader, this is useful when
>>>>>>> `IS_DELETED` metadata column is not necessary and we could use these
>>>>>>> filters (position deletes, etc) to further prune data.
>>>>>>> >> >>>>
>>>>>>> >> >>>> But anyway, I agree that we could postpone it in follow up
>>>>>>> pr.
>>>>>>> >> >>>>
>>>>>>> >> >>>> 2. Batch size configuration
>>>>>>> >> >>>>
>>>>>>> >> >>>> I'm leaning toward option 2.
>>>>>>> >> >>>>
>>>>>>> >> >>>> 3. Spark configuration
>>>>>>> >> >>>>
>>>>>>> >> >>>> I'm leaning towards using different configuration objects.
>>>>>>> >> >>>>
>>>>>>> >> >>>>
>>>>>>> >> >>>>
>>>>>>> >> >>>> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry <
>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>> >> >>>>>
>>>>>>> >> >>>>> Hi Team,
>>>>>>> >> >>>>> Thanks everyone for the reviews on
>>>>>>> https://github.com/apache/iceberg/pull/12298!
>>>>>>> >> >>>>> I have addressed most of comments, but a few questions
>>>>>>> still remain which might merit a bit wider audience:
>>>>>>> >> >>>>>
>>>>>>> >> >>>>> We should decide on the expected filtering behavior when
>>>>>>> the filters are pushed down to the readers. Currently the filters are
>>>>>>> applied as best effort for the file format readers. Some readers (Avro)
>>>>>>> just skip them altogether. There was a suggestion on the PR that we 
>>>>>>> might
>>>>>>> enforce more strict requirements and the readers either reject part of 
>>>>>>> the
>>>>>>> filters, or they could apply them fully.
>>>>>>> >> >>>>> Batch sizes are currently parameters for the reader
>>>>>>> builders which could be set for non-vectorized readers too which could 
>>>>>>> be
>>>>>>> confusing.
>>>>>>> >> >>>>> Currently the Spark batch reader uses different
>>>>>>> configuration objects for ParquetBatchReadConf and OrcBatchReadConf as
>>>>>>> requested by the reviewers of the Comet PR. There was a suggestion on 
>>>>>>> the
>>>>>>> current PR to use a common configuration instead.
>>>>>>> >> >>>>>
>>>>>>> >> >>>>> I would be interested in hearing your thoughts about these
>>>>>>> topics.
>>>>>>> >> >>>>>
>>>>>>> >> >>>>> My current take:
>>>>>>> >> >>>>>
>>>>>>> >> >>>>> File format filters: I am leaning towards keeping the
>>>>>>> current laninet behavior. Especially since Bloom filters are not able 
>>>>>>> to do
>>>>>>> a full filtering, and are often used as a way to filter out unwanted
>>>>>>> records. Another option would be to implement a secondary filtering 
>>>>>>> inside
>>>>>>> the file formats themselves which I think would cause extra complexity, 
>>>>>>> and
>>>>>>> possible code duplication. Whatever the decision here, I would suggest
>>>>>>> moving this out to a next PR as the current changeset is big enough as 
>>>>>>> it
>>>>>>> is.
>>>>>>> >> >>>>> Batch size configuration: Currently this is the only
>>>>>>> property which is different in the batch readers and the non-vectorized
>>>>>>> readers. I see 3 possible solutions:
>>>>>>> >> >>>>>
>>>>>>> >> >>>>> Create different builders for vectorized and non-vectorized
>>>>>>> reads - I don't think the current solution is confusing enough to worth 
>>>>>>> the
>>>>>>> extra class
>>>>>>> >> >>>>> We could put this to the reader configuration property set
>>>>>>> - This could work, but "hide" the possible configuration mode which is
>>>>>>> valid for both Parquet and ORC readers
>>>>>>> >> >>>>> We could keep things as it is now - I would chose this one,
>>>>>>> but I don't have a strong opinion here
>>>>>>> >> >>>>>
>>>>>>> >> >>>>> Spark configuration: TBH, I'm open to bot solution and
>>>>>>> happy to move to the direction the community decides on
>>>>>>> >> >>>>>
>>>>>>> >> >>>>> Thanks,
>>>>>>> >> >>>>> Peter
>>>>>>> >> >>>>>
>>>>>>> >> >>>>> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont:
>>>>>>> 2025. márc. 14., P, 16:31):
>>>>>>> >> >>>>>>
>>>>>>> >> >>>>>> Hi Peter
>>>>>>> >> >>>>>>
>>>>>>> >> >>>>>> Thanks for the update. I will do a new pass on the PR.
>>>>>>> >> >>>>>>
>>>>>>> >> >>>>>> Regards
>>>>>>> >> >>>>>> JB
>>>>>>> >> >>>>>>
>>>>>>> >> >>>>>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry <
>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>> >> >>>>>> >
>>>>>>> >> >>>>>> > Hi Team,
>>>>>>> >> >>>>>> > I have rebased the File Format API proposal (
>>>>>>> https://github.com/apache/iceberg/pull/12298) to include the new
>>>>>>> changes needed for the Variant types. I would love to hear your 
>>>>>>> feedback,
>>>>>>> especially Dan and Ryan, as you were the most active during our
>>>>>>> discussions. If I can help in any way to make the review easier, please 
>>>>>>> let
>>>>>>> me know.
>>>>>>> >> >>>>>> > Thanks,
>>>>>>> >> >>>>>> > Peter
>>>>>>> >> >>>>>> >
>>>>>>> >> >>>>>> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta
>>>>>>> (időpont: 2025. febr. 28., P, 17:50):
>>>>>>> >> >>>>>> >>
>>>>>>> >> >>>>>> >> Hi everyone,
>>>>>>> >> >>>>>> >> Thanks for all of the actionable, relevant feedback on
>>>>>>> the PR (https://github.com/apache/iceberg/pull/12298).
>>>>>>> >> >>>>>> >> Updated the code to address most of them. Please check
>>>>>>> if you agree with the general approach.
>>>>>>> >> >>>>>> >> If there is a consensus about the general approach, I
>>>>>>> could. separate out the PR to smaller pieces so we can have an easier 
>>>>>>> time
>>>>>>> to review and merge those step-by-step.
>>>>>>> >> >>>>>> >> Thanks,
>>>>>>> >> >>>>>> >> Peter
>>>>>>> >> >>>>>> >>
>>>>>>> >> >>>>>> >> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta
>>>>>>> (időpont: 2025. febr. 20., Cs, 14:14):
>>>>>>> >> >>>>>> >>>
>>>>>>> >> >>>>>> >>> Hi Peter
>>>>>>> >> >>>>>> >>>
>>>>>>> >> >>>>>> >>> sorry for the late reply on this.
>>>>>>> >> >>>>>> >>>
>>>>>>> >> >>>>>> >>> I did a pass on the proposal, it's very interesting
>>>>>>> and well written.
>>>>>>> >> >>>>>> >>> I like the DataFile API and definitely worth to
>>>>>>> discuss all together.
>>>>>>> >> >>>>>> >>>
>>>>>>> >> >>>>>> >>> Maybe we can schedule a specific meeting to discuss
>>>>>>> about DataFile API ?
>>>>>>> >> >>>>>> >>>
>>>>>>> >> >>>>>> >>> Thoughts ?
>>>>>>> >> >>>>>> >>>
>>>>>>> >> >>>>>> >>> Regards
>>>>>>> >> >>>>>> >>> JB
>>>>>>> >> >>>>>> >>>
>>>>>>> >> >>>>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry <
>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>> >> >>>>>> >>> >
>>>>>>> >> >>>>>> >>> > Hi Team,
>>>>>>> >> >>>>>> >>> >
>>>>>>> >> >>>>>> >>> > As mentioned earlier on our Community Sync I am
>>>>>>> exploring the possibility to define a FileFormat API for accessing
>>>>>>> different file formats. I have put together a proposal based on my 
>>>>>>> findings.
>>>>>>> >> >>>>>> >>> >
>>>>>>> >> >>>>>> >>> > -------------------
>>>>>>> >> >>>>>> >>> > Iceberg currently supports 3 different file formats:
>>>>>>> Avro, Parquet, ORC. With the introduction of Iceberg V3 specification 
>>>>>>> many
>>>>>>> new features are added to Iceberg. Some of these features like new 
>>>>>>> column
>>>>>>> types, default values require changes at the file format level. The 
>>>>>>> changes
>>>>>>> are added by individual developers with different focus on the different
>>>>>>> file formats. As a result not all of the features are available for 
>>>>>>> every
>>>>>>> supported file format.
>>>>>>> >> >>>>>> >>> > Also there are emerging file formats like Vortex [1]
>>>>>>> or Lance [2] which either by specialization, or by applying newer 
>>>>>>> research
>>>>>>> results could provide better alternatives for certain use-cases like 
>>>>>>> random
>>>>>>> access for data, or storing ML models.
>>>>>>> >> >>>>>> >>> > -------------------
>>>>>>> >> >>>>>> >>> >
>>>>>>> >> >>>>>> >>> > Please check the detailed proposal [3] and the
>>>>>>> google document [4], and comment there or reply on the dev list if you 
>>>>>>> have
>>>>>>> any suggestions.
>>>>>>> >> >>>>>> >>> >
>>>>>>> >> >>>>>> >>> > Thanks,
>>>>>>> >> >>>>>> >>> > Peter
>>>>>>> >> >>>>>> >>> >
>>>>>>> >> >>>>>> >>> > [1] - https://github.com/spiraldb/vortex
>>>>>>> >> >>>>>> >>> > [2] - https://lancedb.github.io/lance/
>>>>>>> >> >>>>>> >>> > [3] - https://github.com/apache/iceberg/issues/12225
>>>>>>> >> >>>>>> >>> > [4] -
>>>>>>> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds
>>>>>>> >> >>>>>> >>> >
>>>>>>>
>>>>>>

Re: [DISCUSS] FileFormat API proposal

Reply via email to