Hi Steven, Thanks for checking this out!
Created commits which contain only the API changes: 1. Everything is stored on the same level as in the current API: https://github.com/apache/iceberg/commit/8d612e074dcb8ee6d5ae354e329d0b78e3138c86 2. Simplified the API to push down everything to the FileGroupLevel: https://github.com/apache/iceberg/commit/26e4704bba08a05d629c61d7eb785bb4e66ce8df This is the diff between the 2 options: https://github.com/apache/iceberg/commit/967b0e5a9b52ee0e303578fccf1d1d538e4689d9 if you would like to see it. I prefer the 2nd option which has fewer classes, and consider pushing down the outputSpecId to the RewriterFileGroup a fair trade for it. I would even prefer to merge the RewritePositionDeleteFiles#FileGroupInfo and RewriteFileGroup#FileGroupInfo if the community would not find that too invasive. If we agree on the general approach, I could create a PR for the API refactor first, and we can move forward from there. What do you think? Thanks, Peter Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2025. febr. 6., Cs, 17:50): > To be honest, it is hard to visualize the interface/structure discussion > in this format. More details (in a doc or PR) can be helpful. > > Regarding "data organization", I feel we probably can set one common > metadata class for all action types. We don't need both > RewriteFilePlanContext and RewritePositionDeletePlanContext. It is kind of > like the "TableScanContext" in the core module. It could be just a super > set of all needed fields. > > > On Wed, Feb 5, 2025 at 2:31 AM Péter Váry <peter.vary.apa...@gmail.com> > wrote: > >> Hi Steven, >> >> Thanks for chiming in! >> >> The decision points I have collected: >> >> - Data organization >> 1. Plan + Group >> 2. Group only >> - Parameter handling >> 1. All strings >> 2. Type params >> - Engine specific parameters for the executor >> 1. Common set calculated by the planner >> 2. Engine gets all user params and recalculates what is needed >> >> If I understand correctly your answer for the "Data organization" >> question is to use multiple objects (like PlanContext and Group). >> Do you have a strong opinion about the other questions? >> >> I am a bit concerned about the number of different classes for this API. >> Currently (before this change) we >> have RewriteFileGroup/RewritePositionDeletesGroup. If we >> introduce RewriteFilePlan/RewritePositionDeletePlan and also >> RewriteFilePlanContext/RewritePositionDeletePlanContext we will have many >> interfaces/classes with similar implementations. This will make it hard to >> understand the code. If we push all of the differences down the group level >> then we might be better off pushing the writeMaxFileSize to the group level >> as well. This way we can get rid of the PlanContext altogether, and only >> have a single generic Plan object. >> >> Thanks, >> Peter >> >> Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2025. febr. 5., Sze, >> 0:18): >> >>> At a high level, it makes sense to separate out the planning and >>> execution to promote reusing the planning code across engines. >>> >>> Just to add 4th class to Russel's list >>> 1) RewriteGroup: A Container that holds all the files that are meant to >>> be compacted along with information about them >>> 2) Rewriter: An engine specific class which knows how to take a >>> RewriteGroup and generate new files, I think this should be independent of >>> the planner below >>> 3) Planner: A Non-Engine specific class which knows how to generate >>> RewriteGroups given a set of parameters >>> 4) RewritePlan: A Non-Engine specific class for the planner output. it >>> contains a list of RewriteGroups and common metadata class `PlanContext`. >>> >>> Rewriter can take in `RewriteGruop` (a list of files to be compacted) >>> and `PlanContext` (common metadata) >>> >>> >>> On Sat, Feb 1, 2025 at 3:06 AM Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> >>>> We probably still have to support it as long as we have V2 Table >>>> support right? >>>> >>>> On Fri, Jan 31, 2025 at 9:13 AM Péter Váry <peter.vary.apa...@gmail.com> >>>> wrote: >>>> >>>>> We could simplify the API a bit, if we omit DeleteFileRewrite. >>>>> Since Anton's work around the Puffin delete vectors, this will become >>>>> obsolete anyway, and focusing on data file rewriting would allow us to >>>>> remove some generics from the API. >>>>> >>>>> WDYT? >>>>> >>>>> Russell Spitzer <russell.spit...@gmail.com> ezt írta (időpont: 2025. >>>>> jan. 21., K, 17:11): >>>>> >>>>>> To bump this back up, I think this is a pretty important change to >>>>>> the core library so it's necessary that we get more folks involved in >>>>>> this >>>>>> discussion. I >>>>>> >>>>>> I agree that the Rewrite Data Files needs to be broken up and >>>>>> realigned if we want to be able to reuuse the code in flink. >>>>>> >>>>>> I think I prefer that we essentially have >>>>>> >>>>>> Three classes >>>>>> 1) RewriteGroup: A Container that holds all the files that are meant >>>>>> to be compacted along with information about them >>>>>> 2) Rewriter: An engine specific class which knows how to take a >>>>>> RewriteGroup and generate new files, I think this should be independent >>>>>> of >>>>>> the planner below >>>>>> 3) Planner: A Non-Engine specific class which knows how to generate >>>>>> RewriteGroups given a set of parameters >>>>>> >>>>>> On Tue, Jan 14, 2025 at 7:08 AM Péter Váry < >>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>> >>>>>>> Hi Team, >>>>>>> >>>>>>> There is ongoing work to bring Flink Table Maintenance to Iceberg >>>>>>> [1]. We already merged the main infrastructure and are currently >>>>>>> working on >>>>>>> implementing the data file rewrite [2]. During the implementation we >>>>>>> found >>>>>>> that part of the compaction planning implemented for Spark compaction, >>>>>>> could and should, be reused in Flink as well. Created a PR [3] to bring >>>>>>> those changes to the core Iceberg. >>>>>>> >>>>>>> The main changes in the API: >>>>>>> >>>>>>> - We need to separate the companction planning from the rewrite >>>>>>> execution >>>>>>> - The planning would collect the files to be compacted and >>>>>>> organize them to compaction tasks/groups. This could be reused >>>>>>> (in the same >>>>>>> way as the query planning) >>>>>>> - The rewrite would actually execute the rewrite. This needs >>>>>>> to contain engine specific code, so we need to have separate >>>>>>> implementation >>>>>>> for in for the separate engines >>>>>>> - We need to decide on the new compaction planning API >>>>>>> >>>>>>> The planning currently generates the data for multiple levels: >>>>>>> >>>>>>> 1. Plan level >>>>>>> - Statistics about the plan: >>>>>>> - Total group count >>>>>>> - Group count in a partition >>>>>>> - Target file size >>>>>>> - Output specification id - only relevant in case of the data >>>>>>> rewrite plan >>>>>>> 2. Group level >>>>>>> - General group info >>>>>>> - Global index >>>>>>> - Partition index >>>>>>> - Partition value >>>>>>> - List of tasks to read the data >>>>>>> - Split size - reader input split size when rewriting (Spark >>>>>>> specific) >>>>>>> - Number of expected output files - used to calculate >>>>>>> shuffling partition numbers (Spark specific) >>>>>>> >>>>>>> I see the following decision points: >>>>>>> >>>>>>> - Data organization: >>>>>>> 1. Plan is the 'result' - everything below that is only >>>>>>> organized based on the multiplicity of the data. So if some value >>>>>>> applies >>>>>>> to every group, then that value belongs to the 'global' plan >>>>>>> variables. If >>>>>>> a value is different for every group, then that value belongs to >>>>>>> the group >>>>>>> (current code) >>>>>>> 2. The group should contain every information which is >>>>>>> required for a single job. So the job (executor) only receives a >>>>>>> single >>>>>>> group and every other bit of information is global. The drawback >>>>>>> is that >>>>>>> some information is duplicated, but cleaner on the executor side. >>>>>>> - Parameter handling: >>>>>>> 1. Use string maps, like we do with the FileRewriter.options >>>>>>> - this allows for more generic API which will be more stable >>>>>>> 2. Use typed, named parameters - when the API is changing the >>>>>>> users might have breaking code, but could easily spot the changes >>>>>>> - Engine specific parameter handling: >>>>>>> 1. We generate a common set of parameters >>>>>>> 2. Engines get the whole compaction configuration, and can >>>>>>> have their own parameter generators >>>>>>> >>>>>>> Currently I am leaning towards: >>>>>>> >>>>>>> - Data organization - 2 - group should contain every information >>>>>>> - Parameter handling - 2 - specific types and named parameters >>>>>>> - Engine specific parameters - 1 - create a common set of >>>>>>> parameters >>>>>>> >>>>>>> Your thoughts? >>>>>>> Thanks, >>>>>>> Peter >>>>>>> >>>>>>> [1] - https://github.com/apache/iceberg/issues/10264 >>>>>>> [2] - https://github.com/apache/iceberg/pull/11497 >>>>>>> [3] - https://github.com/apache/iceberg/pull/11513 >>>>>>> >>>>>>