> Wouldn't PositionDelete<InternalRow> also be an InternalRow in this example? I think that's what i'm confused about.
With the *second approach*, the WriteBuilder doesn’t need to handle PositionDelete objects directly. The conversion layer takes care of that, so the WriteBuilder only needs to work with InternalRow. With the *first approach*, we shift that responsibility to the WriteBuilder, which then has to support both InternalRow and PositionDelete<InternalRow>. In both cases, the FormatModelRegistry API will still expose the more concrete types (PositionDelete / InternalRow). However, under the *first approach*, the lower-level API only needs to handle InternalRow, simplifying its interface. Thanks, Peter Russell Spitzer <russell.spit...@gmail.com> ezt írta (időpont: 2025. szept. 11., Cs, 17:12): > Wouldn't PositionDelete<InternalRow> also be an InternalRow in this > example? I think that's what i'm confused about. > > On Thu, Sep 11, 2025 at 5:35 AM Péter Váry <peter.vary.apa...@gmail.com> > wrote: > >> Thanks, Russell, for taking a look at this! >> >> We need to expose four methods on the user-facing API ( >> FormatModelRegistry): >> >> 1. *writeBuilder* – for writing arbitrary files without Iceberg >> metadata. In the Iceberg codebase, this is exposed via >> FlinkAppenderFactory and >> the GenericAppenderFactory for creating FileAppender<RowData> and >> FileAppender<Record> only. >> 2. *dataWriteBuilder* – for creating and collecting metadata for >> Iceberg DataFiles. >> 3. *equalityDeleteWriteBuilder* – for creating and collecting >> metadata for Iceberg EqualityDeleteFiles. >> 4. *positionDeleteWriteBuilder* – for creating and collecting >> metadata for Iceberg PositionDeleteFiles. >> >> We’d like to implement all four using a single WriteBuilder created by >> the FormatModels. >> >> Your suggestion is a good one—it helps formalize the requirements for the >> build method and also surfaces an important design question: >> >> *Who should be responsible for handling the differences between normal >> rows (InternalRow) and position deletes (PositionDelete<InternalRow>)*? >> >> - Should we have a more complex WriteBuilder class that can create >> both DataFileAppender and PositionDeleteAppender? >> - Or should we push this responsibility to the engine-specific code, >> where we already have some logic (e.g., pathTransformFunc) needed by >> each engine to create the PositionDeleteAppender? >> >> Thanks, >> Peter >> >> >> Russell Spitzer <russell.spit...@gmail.com> ezt írta (időpont: 2025. >> szept. 11., Cs, 0:11): >> >>> I'm a little confused here, I think Ryan mentioned this in the comment >>> here https://github.com/apache/iceberg/pull/12774/files#r2254967177 >>> >>> From my understanding there are two options? >>> >>> 1) We either are producing FormatModels that take a generic row type D >>> and produce writers that all take D and write files. >>> >>> 2) we are creating IcebergModel specific writers that take DataFile, >>> PositionDeleteFile, EqualityDeleteFile etc ... and write files >>> >>> The PositionDelete Converter issue seems to stem from attempting to do >>> both model 1 (being very generic) and 2, wanting special code to deal with >>> PositionDeleteFile<R> objects. >>> >>> It looks like the code in #12774 is mostly doing model 1, but we are >>> trying to add in a specific converter for 2? >>> >>> Maybe i'm totally lost here but I was assuming we would do something a >>> little scala-y like >>> >>> public <T> FileAppender<T> build(Class<T> type) { >>> if (type == DataFile.class) return (FileAppender<T>) new >>> DataFileAppender(); >>> if (type == DeleteFile.class) return (FileAppender<T>) new >>> DeleteFileAppender(); >>> // ... >>> } >>> >>> >>> So that we only register a single signature and if writer specific >>> implementation needs to do something special it can? I'm trying to catch >>> back up to speed on this PR so it may help to do a quick summary of the >>> current state and intent. (At least for me) >>> >>> On Tue, Sep 9, 2025 at 3:42 AM Péter Váry <peter.vary.apa...@gmail.com> >>> wrote: >>> >>>> Hi Renjie, >>>> Thanks for taking a look! >>>> >>>> Let me clarify a few points: >>>> - The converter API is only required for writing position delete files >>>> for V2 tables >>>> - Currently, there are no plans to support vectorized writing via the >>>> java API >>>> - Even if we decide to support vectorized writes, I don't think we >>>> would like to implement it for Positional Deletes, which are deprecated in >>>> the new spec. >>>> - Also, once the positional deletes - which contain the deleted rows - >>>> are deprecated (as planned), the conversion of the Position Deletes with >>>> only file name and position would be trivial, even for the vectorized >>>> writes. >>>> >>>> So from my perspective, the converter method exists purely for backward >>>> compatibility, and we intend to remove it as soon as possible. Sacrificing >>>> good practices for the sake of a deprecated feature doesn’t seem worthwhile >>>> to me. >>>> >>>> Thanks, >>>> Peter >>>> >>>> Renjie Liu <liurenjie2...@gmail.com> ezt írta (időpont: 2025. szept. >>>> 8., H, 12:34): >>>> >>>>> Hi, Peter: >>>>> >>>>> I would vote for the first approach. In spite of the compromises >>>>> described, the api is still cleaner. Also I think there are some problems >>>>> with the converter api. For example, for vectorized implementations such >>>>> as >>>>> comet which accepts columnar batch rather than rows, the converter method >>>>> would make things more complicated. >>>>> >>>>> On Sat, Aug 30, 2025 at 2:49 PM Péter Váry < >>>>> peter.vary.apa...@gmail.com> wrote: >>>>> >>>>>> I’ve initiated a discussion thread regarding the deprecation of >>>>>> Position Deletes containing row data. You can follow it here: >>>>>> https://lists.apache.org/thread/8jw6pb2vq3ghmdqf1yvy8n5n6gg1fq5s >>>>>> >>>>>> We can proceed with the discussion about the native reader/writer >>>>>> deprecation when we decided on the final API, as the chosen design may >>>>>> influence our approach. >>>>>> >>>>>> Since then, one more question has come up - hopefully the last: >>>>>> *How should we handle Position Delete Writers?* >>>>>> The File Format API should return builders for either rows or >>>>>> PositionDelete objects. Currently the method >>>>>> `WriteBuilder.createWriterFunc(Function<MessageType, >>>>>> ParquetValueWriter<?>>)` defines the accepted input parameters for the >>>>>> writer. Users are responsible for ensuring that the writer function and >>>>>> the >>>>>> return type of the `WriteBuilder.build()` are compatible. In the new API, >>>>>> we no longer expose writer functions. We still expose FileContent, since >>>>>> writer configurations vary by content type, but we don’t expose the >>>>>> types. >>>>>> >>>>>> There are two proposals for handling types for the WriteBuilders: >>>>>> >>>>>> 1. *Implicit Type Definition via FileContent* - the builder >>>>>> parameter for FileContent would implicitly define the input type for >>>>>> the >>>>>> writer returned by build(), or >>>>>> 2. *Engine level conversion* - Engines would convert >>>>>> PositionDelete objects to their native types. >>>>>> >>>>>> In code: >>>>>> >>>>>> - In the 1st proposal, the FormatModel.writeBuilder(OutputFile >>>>>> outputFile) can return anything: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> * WriteBuilder builder = FormatModelRegistry.writeBuilder(PARQUET, >>>>>> InternalRow.class, outputFile); FileAppender<InternalRow> >>>>>> appender = >>>>>> .schema(table.schema()) .content(FileContent.DATA) >>>>>> .... .build(); * >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> * // Exposed, but FormatModelRegistry.positionDeleteWriteBuilder >>>>>> should be used instead WriteBuilder builder >>>>>> = FormatModelRegistry.writeBuilder(PARQUET, InternalRow.class, >>>>>> outputFile); >>>>>> FileAppender<PositionDelete<InternalRow>> appender = >>>>>> .schema(table.schema()) .content(FileContent.POSITION_DELETES) >>>>>> .... .build();* >>>>>> >>>>>> >>>>>> - In the 2nd proposal, the FormatModel needs another method: >>>>>> >>>>>> >>>>>> >>>>>> *Function<PositionDelete<D>, D> positionDeleteConverter(Schema >>>>>> schema); *example implementation: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> * return delete -> { deleteRecord.update(0, >>>>>> UTF8String.fromString(delete.path().toString())); >>>>>> deleteRecord.update(1, delete.pos()); deleteRecord.update(2, >>>>>> delete.row()); return deleteRecord; };* >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> * // Content is only used for writer property configuration >>>>>> WriteBuilder<InternalRow> builder = >>>>>> sparkFormatModel.writeBuilder(outputFile); >>>>>> FileAppender<InternalRow> appender = .schema(table.schema()) >>>>>> .content(FileContent.DATA) .... .build();* >>>>>> >>>>>> >>>>>> Drawbacks >>>>>> >>>>>> - Proposal 1: >>>>>> - Type checking for the FileAppenders occurs only at runtime, >>>>>> so user errors surface late. >>>>>> - File Format specification must clearly specify which builder >>>>>> type corresponds to which file content parameter—generics would >>>>>> offer >>>>>> better clarity. >>>>>> - Inconsistent patterns between WriteBuilder and ReadBuilder, >>>>>> as the latter can define output types via generics. >>>>>> - Proposal 2: >>>>>> - Requires FormatModels to implement a converter method to >>>>>> transform PositionDelete<InternalRow> into InternalRow. >>>>>> >>>>>> Since we deprecated writing position delete files in the V3 spec, >>>>>> this extra method in the 2nd proposal will be deprecated too. As a >>>>>> result, >>>>>> in the long run, we will have a nice, clean API. >>>>>> OTOH, if we accept the compromise described in the 1st proposal, the >>>>>> results of our decision will remain, even when the functions are removed. >>>>>> >>>>>> Looking forward to your thoughts. >>>>>> Thanks, Peter >>>>>> >>>>>> On Thu, Aug 14, 2025, 14:12 Péter Váry <peter.vary.apa...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Team, >>>>>>> >>>>>>> During yesterday’s community sync, we discussed the current state of >>>>>>> the File Format API proposal and identified two key questions that >>>>>>> require >>>>>>> input from the broader community: >>>>>>> >>>>>>> *1. Dropping support for Position Delete files with Row Data* >>>>>>> >>>>>>> The current Iceberg V2 spec [1] defines two types of position delete >>>>>>> files: >>>>>>> >>>>>>> - Files that store only the file name and row position. >>>>>>> - Files that also store the deleted row data. >>>>>>> >>>>>>> Although this feature is defined in the spec and some tests exist in >>>>>>> the Iceberg codebase, we’re not aware of any actual implementation using >>>>>>> the second type (with row data). Supporting V2 table writing via the new >>>>>>> File Format API would be simpler if we dropped support for this feature. >>>>>>> If you know of any use case or reason to retain support for position >>>>>>> deletes with row data, please let us know. >>>>>>> >>>>>>> *2. Deprecating Native File Format Readers/Writers in the API* >>>>>>> >>>>>>> The current API contains format-specific readers/writers for >>>>>>> Parquet, Avro, and ORC. With the introduction of the InternalData and >>>>>>> File >>>>>>> Format APIs, Iceberg users can now write files using: >>>>>>> >>>>>>> - InternalData API for metadata files (manifest, manifest list, >>>>>>> partition stats). >>>>>>> - File Format API for data and delete files. >>>>>>> >>>>>>> I propose we deprecate the original format-specific writers and >>>>>>> guide users to use the new APIs based on the target file type. If you’re >>>>>>> aware of any use cases that still require the original format-specific >>>>>>> writers, please share them. >>>>>>> >>>>>>> Thanks, >>>>>>> Peter >>>>>>> >>>>>>> [1] - Position Delete File Spec: >>>>>>> https://iceberg.apache.org/spec/?h=delete#position-delete-files >>>>>>> >>>>>>> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. >>>>>>> júl. 22., K, 16:09): >>>>>>> >>>>>>>> Also put together a solution where the Engine specific format >>>>>>>> transformation is separated from the writer, and the engines need to >>>>>>>> take >>>>>>>> care of it separately. >>>>>>>> This is somewhat complicated on the implementation side (see: >>>>>>>> [RowDataTransformer]( >>>>>>>> https://github.com/apache/iceberg/pull/12298/files#diff-562fa4cc369c908a157f59a9235fd3f389096451e7901686fba37c87b53dee08), >>>>>>>> and [InternalRowTransformer]( >>>>>>>> https://github.com/apache/iceberg/pull/12298/files#diff-546f9dc30e3207d1d2bc0a2722976b55f5a04dcf85a22855e4f400500c317140)), >>>>>>>> but simplifies the API. >>>>>>>> >>>>>>>> @rdblue: Please check the proposed solution. I think this is what >>>>>>>> you have suggested >>>>>>>> >>>>>>>> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. >>>>>>>> jún. 30., H, 18:42): >>>>>>>> >>>>>>>>> During the PR review [1], we began exploring what could we use as >>>>>>>>> an intermediate layer to reduce the need for engines and file formats >>>>>>>>> to >>>>>>>>> implement the full matrix of file format - object model conversions. >>>>>>>>> >>>>>>>>> To support this discussion, I’ve created and run a set of >>>>>>>>> performance benchmarks and compiled a document outlining the potential >>>>>>>>> benefits and trade-offs [2]. >>>>>>>>> >>>>>>>>> Feedback is welcome, feel free to comment on the document, the PR, >>>>>>>>> or directly in this thread. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Peter >>>>>>>>> >>>>>>>>> [1] - PR discussion - >>>>>>>>> https://github.com/apache/iceberg/pull/12774#discussion_r2093626096 >>>>>>>>> [2] - File Format and engine object model transformation >>>>>>>>> performance - >>>>>>>>> https://docs.google.com/document/d/1GdA8IowKMtS3QVdm8s-0X-ZRYetcHv2bhQ9mrSd3fd4 >>>>>>>>> >>>>>>>>> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. >>>>>>>>> máj. 7., Sze, 13:15): >>>>>>>>> >>>>>>>>>> Hi everyone, >>>>>>>>>> The proposed API part is reviewed and ready to go. See: >>>>>>>>>> https://github.com/apache/iceberg/pull/12774 >>>>>>>>>> Thanks to everyone who reviewed it already! >>>>>>>>>> >>>>>>>>>> Many of you wanted to review, but I know that the time >>>>>>>>>> constraints are there for everyone. I still very much would like to >>>>>>>>>> hear >>>>>>>>>> your voices, so I will not merge the PR this week. Please review it >>>>>>>>>> if you. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Peter >>>>>>>>>> >>>>>>>>>> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: >>>>>>>>>> 2025. ápr. 16., Sze, 7:02): >>>>>>>>>> >>>>>>>>>>> Hi Renjie, >>>>>>>>>>> The first one for the proposed new API is here: >>>>>>>>>>> https://github.com/apache/iceberg/pull/12774 >>>>>>>>>>> Thanks, Peter >>>>>>>>>>> >>>>>>>>>>> On Wed, Apr 16, 2025, 05:40 Renjie Liu <liurenjie2...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, Peter: >>>>>>>>>>>> >>>>>>>>>>>> Thanks for the effort. I totally agree with splitting them into >>>>>>>>>>>> smaller prs to move forward. >>>>>>>>>>>> >>>>>>>>>>>> I'm quite interested in this topic, and please ping me in those >>>>>>>>>>>> splitted prs and I'll help to review. >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Apr 14, 2025 at 11:22 PM Jean-Baptiste Onofré < >>>>>>>>>>>> j...@nanthrax.net> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Peter >>>>>>>>>>>>> >>>>>>>>>>>>> Awesome ! Thank you so much ! >>>>>>>>>>>>> I will do a new pass. >>>>>>>>>>>>> >>>>>>>>>>>>> Regards >>>>>>>>>>>>> JB >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Apr 11, 2025 at 3:48 PM Péter Váry < >>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>> > >>>>>>>>>>>>> > Hi JB, >>>>>>>>>>>>> > >>>>>>>>>>>>> > Separated out the proposed interfaces to a new PR: >>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774. >>>>>>>>>>>>> > Reviewers can check that out if they are only interested in >>>>>>>>>>>>> how the new API would look like. >>>>>>>>>>>>> > >>>>>>>>>>>>> > Thanks, >>>>>>>>>>>>> > Peter >>>>>>>>>>>>> > >>>>>>>>>>>>> > Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: >>>>>>>>>>>>> 2025. ápr. 10., Cs, 18:25): >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> Hi Peter >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> Thanks for the ping about the PR. >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> Maybe, to facilitate the review and move forward faster, we >>>>>>>>>>>>> should >>>>>>>>>>>>> >> split the PR in smaller PRs: >>>>>>>>>>>>> >> - one with the interfaces (ReadBuilder, AppenderBuilder, >>>>>>>>>>>>> ObjectModel, >>>>>>>>>>>>> >> AppenderBuilder, DataWriterBuilder, ...) >>>>>>>>>>>>> >> - one for each file providers (Parquet, Avro, ORC) >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> Thoughts ? I can help on the split if needed. >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> Regards >>>>>>>>>>>>> >> JB >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> On Thu, Apr 10, 2025 at 5:16 AM Péter Váry < >>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > Since the 1.9.0 release candidate has been created, I >>>>>>>>>>>>> would like to resurrect this PR: >>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298 to ensure that >>>>>>>>>>>>> we have as long a testing period as possible for it. >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > To recap, here is what the PR does after the review >>>>>>>>>>>>> rounds: >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > Created 3 interface classes which are implemented by the >>>>>>>>>>>>> file formats: >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > ReadBuilder - Builder for reading data from data files >>>>>>>>>>>>> >> > AppenderBuilder - Builder for writing data to data files >>>>>>>>>>>>> >> > ObjectModel - Providing ReadBuilders, and >>>>>>>>>>>>> AppenderBuilders for the specific data file format and object >>>>>>>>>>>>> model pair >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > Updated the Parquet, Avro, ORC implementation for this >>>>>>>>>>>>> interfaces, and deprecated the old reader/writer APIs >>>>>>>>>>>>> >> > Created interface classes which will be used by the >>>>>>>>>>>>> actual readers/writers of the data files: >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > AppenderBuilder - Builder for writing a file >>>>>>>>>>>>> >> > DataWriterBuilder - Builder for generating a data file >>>>>>>>>>>>> >> > PositionDeleteWriterBuilder - Builder for generating a >>>>>>>>>>>>> position delete file >>>>>>>>>>>>> >> > EqualityDeleteWriterBuilder - Builder for generating an >>>>>>>>>>>>> equality delete file >>>>>>>>>>>>> >> > No ReadBuilder here - the file format reader builder is >>>>>>>>>>>>> reused >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > Created a WriterBuilder class which implements the >>>>>>>>>>>>> interfaces above >>>>>>>>>>>>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) >>>>>>>>>>>>> based on a provided file format specific AppenderBuilder >>>>>>>>>>>>> >> > Created an ObjectModelRegistry which stores the available >>>>>>>>>>>>> ObjectModels, and engines and users could request the readers >>>>>>>>>>>>> (ReadBuilder) >>>>>>>>>>>>> and writers >>>>>>>>>>>>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) >>>>>>>>>>>>> from. >>>>>>>>>>>>> >> > Created the appropriate ObjectModels: >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > GenericObjectModels - for reading and writing Iceberg >>>>>>>>>>>>> Records >>>>>>>>>>>>> >> > SparkObjectModels - for reading (vectorized and >>>>>>>>>>>>> non-vectorized) and writing Spark InternalRow/ColumnarBatch >>>>>>>>>>>>> objects >>>>>>>>>>>>> >> > FlinkObjectModels - for reading and writing Flink RowData >>>>>>>>>>>>> objects >>>>>>>>>>>>> >> > An arrow object model is also registered for vectorized >>>>>>>>>>>>> reads of Parquet files into Arrow ColumnarBatch objects >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > Updated the production code where the reading and writing >>>>>>>>>>>>> happens to use the ObjectModelRegistry and the new reader/writer >>>>>>>>>>>>> interfaces >>>>>>>>>>>>> to access data files >>>>>>>>>>>>> >> > Kept the testing code intact to ensure that the new >>>>>>>>>>>>> API/code is not breaking anything >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > The original change was not small, and grew substantially >>>>>>>>>>>>> during the review rounds. So if you have questions, or I can do >>>>>>>>>>>>> anything to >>>>>>>>>>>>> make the review easier, don't hesitate to ask. I am happy to do >>>>>>>>>>>>> anything to >>>>>>>>>>>>> move this forward. >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > Thanks, >>>>>>>>>>>>> >> > Peter >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta >>>>>>>>>>>>> (időpont: 2025. márc. 26., Sze, 14:54): >>>>>>>>>>>>> >> >> >>>>>>>>>>>>> >> >> Hi everyone, >>>>>>>>>>>>> >> >> >>>>>>>>>>>>> >> >> I have updated the File Format API PR ( >>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298) based on the >>>>>>>>>>>>> answers and review comments. >>>>>>>>>>>>> >> >> >>>>>>>>>>>>> >> >> I would like to merge this only after the 1.9.0 release >>>>>>>>>>>>> so we have more time finding any issues and solving them before >>>>>>>>>>>>> this goes >>>>>>>>>>>>> to a release for the users. >>>>>>>>>>>>> >> >> >>>>>>>>>>>>> >> >> For this I have updated the deprecation comments >>>>>>>>>>>>> accordingly. >>>>>>>>>>>>> >> >> I would like to ask you to review the PR, so we iron out >>>>>>>>>>>>> any possible requested changes and be ready for the merge as soon >>>>>>>>>>>>> as >>>>>>>>>>>>> possible after the 1.9.0 release. >>>>>>>>>>>>> >> >> >>>>>>>>>>>>> >> >> Thanks, >>>>>>>>>>>>> >> >> Peter >>>>>>>>>>>>> >> >> >>>>>>>>>>>>> >> >> Péter Váry <peter.vary.apa...@gmail.com> ezt írta >>>>>>>>>>>>> (időpont: 2025. márc. 21., P, 14:32): >>>>>>>>>>>>> >> >>> >>>>>>>>>>>>> >> >>> Hi Renije, >>>>>>>>>>>>> >> >>> >>>>>>>>>>>>> >> >>> > 1. File format filters >>>>>>>>>>>>> >> >>> > >>>>>>>>>>>>> >> >>> > Do the filters include both filter expressions from >>>>>>>>>>>>> both user query and delete filter? >>>>>>>>>>>>> >> >>> >>>>>>>>>>>>> >> >>> The current discussion is about the filters from the >>>>>>>>>>>>> user query. >>>>>>>>>>>>> >> >>> >>>>>>>>>>>>> >> >>> About the delete filter: >>>>>>>>>>>>> >> >>> Based on the suggestions on the PR, I have moved the >>>>>>>>>>>>> delete filter out from the main API. Created a >>>>>>>>>>>>> `SupportsDeleteFilter` >>>>>>>>>>>>> interface for it which would allow pushing down to the filter to >>>>>>>>>>>>> Parquet >>>>>>>>>>>>> vectorized readers in Spark, as this is the only place where we >>>>>>>>>>>>> currently >>>>>>>>>>>>> implemented this feature. >>>>>>>>>>>>> >> >>> >>>>>>>>>>>>> >> >>> >>>>>>>>>>>>> >> >>> Renjie Liu <liurenjie2...@gmail.com> ezt írta >>>>>>>>>>>>> (időpont: 2025. márc. 21., P, 14:11): >>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>> >> >>>> Hi, Peter: >>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>> >> >>>> Thanks for the effort on this. >>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>> >> >>>> 1. File format filters >>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>> >> >>>> Do the filters include both filter expressions from >>>>>>>>>>>>> both user query and delete filter? >>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>> >> >>>> For filters from user query, I agree with you that we >>>>>>>>>>>>> should keep the current behavior. >>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>> >> >>>> For delete filters associated with data files, at >>>>>>>>>>>>> first I thought file format readers should not care about this. >>>>>>>>>>>>> But now I >>>>>>>>>>>>> realized that maybe we need to also push it to file reader, this >>>>>>>>>>>>> is useful >>>>>>>>>>>>> when `IS_DELETED` metadata column is not necessary and we could >>>>>>>>>>>>> use these >>>>>>>>>>>>> filters (position deletes, etc) to further prune data. >>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>> >> >>>> But anyway, I agree that we could postpone it in >>>>>>>>>>>>> follow up pr. >>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>> >> >>>> 2. Batch size configuration >>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>> >> >>>> I'm leaning toward option 2. >>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>> >> >>>> 3. Spark configuration >>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>> >> >>>> I'm leaning towards using different configuration >>>>>>>>>>>>> objects. >>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>> >> >>>> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry < >>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> >> >>>>> Hi Team, >>>>>>>>>>>>> >> >>>>> Thanks everyone for the reviews on >>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298! >>>>>>>>>>>>> >> >>>>> I have addressed most of comments, but a few >>>>>>>>>>>>> questions still remain which might merit a bit wider audience: >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> >> >>>>> We should decide on the expected filtering behavior >>>>>>>>>>>>> when the filters are pushed down to the readers. Currently the >>>>>>>>>>>>> filters are >>>>>>>>>>>>> applied as best effort for the file format readers. Some readers >>>>>>>>>>>>> (Avro) >>>>>>>>>>>>> just skip them altogether. There was a suggestion on the PR that >>>>>>>>>>>>> we might >>>>>>>>>>>>> enforce more strict requirements and the readers either reject >>>>>>>>>>>>> part of the >>>>>>>>>>>>> filters, or they could apply them fully. >>>>>>>>>>>>> >> >>>>> Batch sizes are currently parameters for the reader >>>>>>>>>>>>> builders which could be set for non-vectorized readers too which >>>>>>>>>>>>> could be >>>>>>>>>>>>> confusing. >>>>>>>>>>>>> >> >>>>> Currently the Spark batch reader uses different >>>>>>>>>>>>> configuration objects for ParquetBatchReadConf and >>>>>>>>>>>>> OrcBatchReadConf as >>>>>>>>>>>>> requested by the reviewers of the Comet PR. There was a >>>>>>>>>>>>> suggestion on the >>>>>>>>>>>>> current PR to use a common configuration instead. >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> >> >>>>> I would be interested in hearing your thoughts about >>>>>>>>>>>>> these topics. >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> >> >>>>> My current take: >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> >> >>>>> File format filters: I am leaning towards keeping the >>>>>>>>>>>>> current laninet behavior. Especially since Bloom filters are not >>>>>>>>>>>>> able to do >>>>>>>>>>>>> a full filtering, and are often used as a way to filter out >>>>>>>>>>>>> unwanted >>>>>>>>>>>>> records. Another option would be to implement a secondary >>>>>>>>>>>>> filtering inside >>>>>>>>>>>>> the file formats themselves which I think would cause extra >>>>>>>>>>>>> complexity, and >>>>>>>>>>>>> possible code duplication. Whatever the decision here, I would >>>>>>>>>>>>> suggest >>>>>>>>>>>>> moving this out to a next PR as the current changeset is big >>>>>>>>>>>>> enough as it >>>>>>>>>>>>> is. >>>>>>>>>>>>> >> >>>>> Batch size configuration: Currently this is the only >>>>>>>>>>>>> property which is different in the batch readers and the >>>>>>>>>>>>> non-vectorized >>>>>>>>>>>>> readers. I see 3 possible solutions: >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> >> >>>>> Create different builders for vectorized and >>>>>>>>>>>>> non-vectorized reads - I don't think the current solution is >>>>>>>>>>>>> confusing >>>>>>>>>>>>> enough to worth the extra class >>>>>>>>>>>>> >> >>>>> We could put this to the reader configuration >>>>>>>>>>>>> property set - This could work, but "hide" the possible >>>>>>>>>>>>> configuration mode >>>>>>>>>>>>> which is valid for both Parquet and ORC readers >>>>>>>>>>>>> >> >>>>> We could keep things as it is now - I would chose >>>>>>>>>>>>> this one, but I don't have a strong opinion here >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> >> >>>>> Spark configuration: TBH, I'm open to bot solution >>>>>>>>>>>>> and happy to move to the direction the community decides on >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> >> >>>>> Thanks, >>>>>>>>>>>>> >> >>>>> Peter >>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>> >> >>>>> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta >>>>>>>>>>>>> (időpont: 2025. márc. 14., P, 16:31): >>>>>>>>>>>>> >> >>>>>> >>>>>>>>>>>>> >> >>>>>> Hi Peter >>>>>>>>>>>>> >> >>>>>> >>>>>>>>>>>>> >> >>>>>> Thanks for the update. I will do a new pass on the >>>>>>>>>>>>> PR. >>>>>>>>>>>>> >> >>>>>> >>>>>>>>>>>>> >> >>>>>> Regards >>>>>>>>>>>>> >> >>>>>> JB >>>>>>>>>>>>> >> >>>>>> >>>>>>>>>>>>> >> >>>>>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry < >>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>> >> >>>>>> > >>>>>>>>>>>>> >> >>>>>> > Hi Team, >>>>>>>>>>>>> >> >>>>>> > I have rebased the File Format API proposal ( >>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298) to include the >>>>>>>>>>>>> new changes needed for the Variant types. I would love to hear >>>>>>>>>>>>> your >>>>>>>>>>>>> feedback, especially Dan and Ryan, as you were the most active >>>>>>>>>>>>> during our >>>>>>>>>>>>> discussions. If I can help in any way to make the review easier, >>>>>>>>>>>>> please let >>>>>>>>>>>>> me know. >>>>>>>>>>>>> >> >>>>>> > Thanks, >>>>>>>>>>>>> >> >>>>>> > Peter >>>>>>>>>>>>> >> >>>>>> > >>>>>>>>>>>>> >> >>>>>> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta >>>>>>>>>>>>> (időpont: 2025. febr. 28., P, 17:50): >>>>>>>>>>>>> >> >>>>>> >> >>>>>>>>>>>>> >> >>>>>> >> Hi everyone, >>>>>>>>>>>>> >> >>>>>> >> Thanks for all of the actionable, relevant >>>>>>>>>>>>> feedback on the PR ( >>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298). >>>>>>>>>>>>> >> >>>>>> >> Updated the code to address most of them. Please >>>>>>>>>>>>> check if you agree with the general approach. >>>>>>>>>>>>> >> >>>>>> >> If there is a consensus about the general >>>>>>>>>>>>> approach, I could. separate out the PR to smaller pieces so we >>>>>>>>>>>>> can have an >>>>>>>>>>>>> easier time to review and merge those step-by-step. >>>>>>>>>>>>> >> >>>>>> >> Thanks, >>>>>>>>>>>>> >> >>>>>> >> Peter >>>>>>>>>>>>> >> >>>>>> >> >>>>>>>>>>>>> >> >>>>>> >> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta >>>>>>>>>>>>> (időpont: 2025. febr. 20., Cs, 14:14): >>>>>>>>>>>>> >> >>>>>> >>> >>>>>>>>>>>>> >> >>>>>> >>> Hi Peter >>>>>>>>>>>>> >> >>>>>> >>> >>>>>>>>>>>>> >> >>>>>> >>> sorry for the late reply on this. >>>>>>>>>>>>> >> >>>>>> >>> >>>>>>>>>>>>> >> >>>>>> >>> I did a pass on the proposal, it's very >>>>>>>>>>>>> interesting and well written. >>>>>>>>>>>>> >> >>>>>> >>> I like the DataFile API and definitely worth to >>>>>>>>>>>>> discuss all together. >>>>>>>>>>>>> >> >>>>>> >>> >>>>>>>>>>>>> >> >>>>>> >>> Maybe we can schedule a specific meeting to >>>>>>>>>>>>> discuss about DataFile API ? >>>>>>>>>>>>> >> >>>>>> >>> >>>>>>>>>>>>> >> >>>>>> >>> Thoughts ? >>>>>>>>>>>>> >> >>>>>> >>> >>>>>>>>>>>>> >> >>>>>> >>> Regards >>>>>>>>>>>>> >> >>>>>> >>> JB >>>>>>>>>>>>> >> >>>>>> >>> >>>>>>>>>>>>> >> >>>>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry < >>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>> >> >>>>>> >>> > >>>>>>>>>>>>> >> >>>>>> >>> > Hi Team, >>>>>>>>>>>>> >> >>>>>> >>> > >>>>>>>>>>>>> >> >>>>>> >>> > As mentioned earlier on our Community Sync I >>>>>>>>>>>>> am exploring the possibility to define a FileFormat API for >>>>>>>>>>>>> accessing >>>>>>>>>>>>> different file formats. I have put together a proposal based on >>>>>>>>>>>>> my findings. >>>>>>>>>>>>> >> >>>>>> >>> > >>>>>>>>>>>>> >> >>>>>> >>> > ------------------- >>>>>>>>>>>>> >> >>>>>> >>> > Iceberg currently supports 3 different file >>>>>>>>>>>>> formats: Avro, Parquet, ORC. With the introduction of Iceberg V3 >>>>>>>>>>>>> specification many new features are added to Iceberg. Some of >>>>>>>>>>>>> these >>>>>>>>>>>>> features like new column types, default values require changes at >>>>>>>>>>>>> the file >>>>>>>>>>>>> format level. The changes are added by individual developers with >>>>>>>>>>>>> different >>>>>>>>>>>>> focus on the different file formats. As a result not all of the >>>>>>>>>>>>> features >>>>>>>>>>>>> are available for every supported file format. >>>>>>>>>>>>> >> >>>>>> >>> > Also there are emerging file formats like >>>>>>>>>>>>> Vortex [1] or Lance [2] which either by specialization, or by >>>>>>>>>>>>> applying >>>>>>>>>>>>> newer research results could provide better alternatives for >>>>>>>>>>>>> certain >>>>>>>>>>>>> use-cases like random access for data, or storing ML models. >>>>>>>>>>>>> >> >>>>>> >>> > ------------------- >>>>>>>>>>>>> >> >>>>>> >>> > >>>>>>>>>>>>> >> >>>>>> >>> > Please check the detailed proposal [3] and the >>>>>>>>>>>>> google document [4], and comment there or reply on the dev list >>>>>>>>>>>>> if you have >>>>>>>>>>>>> any suggestions. >>>>>>>>>>>>> >> >>>>>> >>> > >>>>>>>>>>>>> >> >>>>>> >>> > Thanks, >>>>>>>>>>>>> >> >>>>>> >>> > Peter >>>>>>>>>>>>> >> >>>>>> >>> > >>>>>>>>>>>>> >> >>>>>> >>> > [1] - https://github.com/spiraldb/vortex >>>>>>>>>>>>> >> >>>>>> >>> > [2] - https://lancedb.github.io/lance/ >>>>>>>>>>>>> >> >>>>>> >>> > [3] - >>>>>>>>>>>>> https://github.com/apache/iceberg/issues/12225 >>>>>>>>>>>>> >> >>>>>> >>> > [4] - >>>>>>>>>>>>> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds >>>>>>>>>>>>> >> >>>>>> >>> > >>>>>>>>>>>>> >>>>>>>>>>>>