Thanks for all the feedback! Create the PR for the API discussion: https://github.com/apache/iceberg/pull/12306
Thanks, Peter Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2025. febr. 13., Cs, 22:28): > looking at "RewriteDataFilesSparkAction" from your PR #11513, I am fine > that the RewriteExecutionContext is captured in the `Plan` object. My > earlier point is that we need to pass those common metadata/context to the > executor. We don't have to define a separate `PlanInfo` for that purpose if > they are captured in the `Plan` already. > > [image: image.png] > > My other point is that those context/plan info may not need to be > duplicated. E.g., currently the "RewriteExecutionContext" look duplicated > in "RewriteDataFileSparkAction" and "RewritePositionDeleteFilesSparkAction" > classes > > On Fri, Feb 7, 2025 at 4:56 AM Péter Váry <peter.vary.apa...@gmail.com> > wrote: > >> Hi Steven, >> >> Thanks for checking this out! >> >> Created commits which contain only the API changes: >> >> 1. Everything is stored on the same level as in the current API: >> >> https://github.com/apache/iceberg/commit/8d612e074dcb8ee6d5ae354e329d0b78e3138c86 >> 2. Simplified the API to push down everything to the FileGroupLevel: >> >> https://github.com/apache/iceberg/commit/26e4704bba08a05d629c61d7eb785bb4e66ce8df >> >> This is the diff between the 2 options: >> https://github.com/apache/iceberg/commit/967b0e5a9b52ee0e303578fccf1d1d538e4689d9 >> if you would like to see it. >> >> I prefer the 2nd option which has fewer classes, and consider pushing >> down the outputSpecId to the RewriterFileGroup a fair trade for it. >> I would even prefer to merge the RewritePositionDeleteFiles#FileGroupInfo >> and RewriteFileGroup#FileGroupInfo if the community would not find that too >> invasive. >> >> If we agree on the general approach, I could create a PR for the API >> refactor first, and we can move forward from there. >> What do you think? >> >> Thanks, >> Peter >> >> Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2025. febr. 6., Cs, >> 17:50): >> >>> To be honest, it is hard to visualize the interface/structure discussion >>> in this format. More details (in a doc or PR) can be helpful. >>> >>> Regarding "data organization", I feel we probably can set one common >>> metadata class for all action types. We don't need both >>> RewriteFilePlanContext and RewritePositionDeletePlanContext. It is kind of >>> like the "TableScanContext" in the core module. It could be just a super >>> set of all needed fields. >>> >>> >>> On Wed, Feb 5, 2025 at 2:31 AM Péter Váry <peter.vary.apa...@gmail.com> >>> wrote: >>> >>>> Hi Steven, >>>> >>>> Thanks for chiming in! >>>> >>>> The decision points I have collected: >>>> >>>> - Data organization >>>> 1. Plan + Group >>>> 2. Group only >>>> - Parameter handling >>>> 1. All strings >>>> 2. Type params >>>> - Engine specific parameters for the executor >>>> 1. Common set calculated by the planner >>>> 2. Engine gets all user params and recalculates what is needed >>>> >>>> If I understand correctly your answer for the "Data organization" >>>> question is to use multiple objects (like PlanContext and Group). >>>> Do you have a strong opinion about the other questions? >>>> >>>> I am a bit concerned about the number of different classes for this >>>> API. Currently (before this change) we >>>> have RewriteFileGroup/RewritePositionDeletesGroup. If we >>>> introduce RewriteFilePlan/RewritePositionDeletePlan and also >>>> RewriteFilePlanContext/RewritePositionDeletePlanContext we will have many >>>> interfaces/classes with similar implementations. This will make it hard to >>>> understand the code. If we push all of the differences down the group level >>>> then we might be better off pushing the writeMaxFileSize to the group level >>>> as well. This way we can get rid of the PlanContext altogether, and only >>>> have a single generic Plan object. >>>> >>>> Thanks, >>>> Peter >>>> >>>> Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2025. febr. 5., >>>> Sze, 0:18): >>>> >>>>> At a high level, it makes sense to separate out the planning and >>>>> execution to promote reusing the planning code across engines. >>>>> >>>>> Just to add 4th class to Russel's list >>>>> 1) RewriteGroup: A Container that holds all the files that are meant >>>>> to be compacted along with information about them >>>>> 2) Rewriter: An engine specific class which knows how to take a >>>>> RewriteGroup and generate new files, I think this should be independent of >>>>> the planner below >>>>> 3) Planner: A Non-Engine specific class which knows how to generate >>>>> RewriteGroups given a set of parameters >>>>> 4) RewritePlan: A Non-Engine specific class for the planner output. it >>>>> contains a list of RewriteGroups and common metadata class `PlanContext`. >>>>> >>>>> Rewriter can take in `RewriteGruop` (a list of files to be compacted) >>>>> and `PlanContext` (common metadata) >>>>> >>>>> >>>>> On Sat, Feb 1, 2025 at 3:06 AM Russell Spitzer < >>>>> russell.spit...@gmail.com> wrote: >>>>> >>>>>> We probably still have to support it as long as we have V2 Table >>>>>> support right? >>>>>> >>>>>> On Fri, Jan 31, 2025 at 9:13 AM Péter Váry < >>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>> >>>>>>> We could simplify the API a bit, if we omit DeleteFileRewrite. >>>>>>> Since Anton's work around the Puffin delete vectors, this will >>>>>>> become obsolete anyway, and focusing on data file rewriting would allow >>>>>>> us >>>>>>> to remove some generics from the API. >>>>>>> >>>>>>> WDYT? >>>>>>> >>>>>>> Russell Spitzer <russell.spit...@gmail.com> ezt írta (időpont: >>>>>>> 2025. jan. 21., K, 17:11): >>>>>>> >>>>>>>> To bump this back up, I think this is a pretty important change to >>>>>>>> the core library so it's necessary that we get more folks involved in >>>>>>>> this >>>>>>>> discussion. I >>>>>>>> >>>>>>>> I agree that the Rewrite Data Files needs to be broken up and >>>>>>>> realigned if we want to be able to reuuse the code in flink. >>>>>>>> >>>>>>>> I think I prefer that we essentially have >>>>>>>> >>>>>>>> Three classes >>>>>>>> 1) RewriteGroup: A Container that holds all the files that are >>>>>>>> meant to be compacted along with information about them >>>>>>>> 2) Rewriter: An engine specific class which knows how to take a >>>>>>>> RewriteGroup and generate new files, I think this should be >>>>>>>> independent of >>>>>>>> the planner below >>>>>>>> 3) Planner: A Non-Engine specific class which knows how to generate >>>>>>>> RewriteGroups given a set of parameters >>>>>>>> >>>>>>>> On Tue, Jan 14, 2025 at 7:08 AM Péter Váry < >>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Team, >>>>>>>>> >>>>>>>>> There is ongoing work to bring Flink Table Maintenance to Iceberg >>>>>>>>> [1]. We already merged the main infrastructure and are currently >>>>>>>>> working on >>>>>>>>> implementing the data file rewrite [2]. During the implementation we >>>>>>>>> found >>>>>>>>> that part of the compaction planning implemented for Spark compaction, >>>>>>>>> could and should, be reused in Flink as well. Created a PR [3] to >>>>>>>>> bring >>>>>>>>> those changes to the core Iceberg. >>>>>>>>> >>>>>>>>> The main changes in the API: >>>>>>>>> >>>>>>>>> - We need to separate the companction planning from the >>>>>>>>> rewrite execution >>>>>>>>> - The planning would collect the files to be compacted and >>>>>>>>> organize them to compaction tasks/groups. This could be reused >>>>>>>>> (in the same >>>>>>>>> way as the query planning) >>>>>>>>> - The rewrite would actually execute the rewrite. This >>>>>>>>> needs to contain engine specific code, so we need to have >>>>>>>>> separate >>>>>>>>> implementation for in for the separate engines >>>>>>>>> - We need to decide on the new compaction planning API >>>>>>>>> >>>>>>>>> The planning currently generates the data for multiple levels: >>>>>>>>> >>>>>>>>> 1. Plan level >>>>>>>>> - Statistics about the plan: >>>>>>>>> - Total group count >>>>>>>>> - Group count in a partition >>>>>>>>> - Target file size >>>>>>>>> - Output specification id - only relevant in case of the >>>>>>>>> data rewrite plan >>>>>>>>> 2. Group level >>>>>>>>> - General group info >>>>>>>>> - Global index >>>>>>>>> - Partition index >>>>>>>>> - Partition value >>>>>>>>> - List of tasks to read the data >>>>>>>>> - Split size - reader input split size when rewriting >>>>>>>>> (Spark specific) >>>>>>>>> - Number of expected output files - used to calculate >>>>>>>>> shuffling partition numbers (Spark specific) >>>>>>>>> >>>>>>>>> I see the following decision points: >>>>>>>>> >>>>>>>>> - Data organization: >>>>>>>>> 1. Plan is the 'result' - everything below that is only >>>>>>>>> organized based on the multiplicity of the data. So if some >>>>>>>>> value applies >>>>>>>>> to every group, then that value belongs to the 'global' plan >>>>>>>>> variables. If >>>>>>>>> a value is different for every group, then that value belongs >>>>>>>>> to the group >>>>>>>>> (current code) >>>>>>>>> 2. The group should contain every information which is >>>>>>>>> required for a single job. So the job (executor) only receives >>>>>>>>> a single >>>>>>>>> group and every other bit of information is global. The >>>>>>>>> drawback is that >>>>>>>>> some information is duplicated, but cleaner on the executor >>>>>>>>> side. >>>>>>>>> - Parameter handling: >>>>>>>>> 1. Use string maps, like we do with the >>>>>>>>> FileRewriter.options - this allows for more generic API which >>>>>>>>> will be more >>>>>>>>> stable >>>>>>>>> 2. Use typed, named parameters - when the API is changing >>>>>>>>> the users might have breaking code, but could easily spot the >>>>>>>>> changes >>>>>>>>> - Engine specific parameter handling: >>>>>>>>> 1. We generate a common set of parameters >>>>>>>>> 2. Engines get the whole compaction configuration, and can >>>>>>>>> have their own parameter generators >>>>>>>>> >>>>>>>>> Currently I am leaning towards: >>>>>>>>> >>>>>>>>> - Data organization - 2 - group should contain every >>>>>>>>> information >>>>>>>>> - Parameter handling - 2 - specific types and named parameters >>>>>>>>> - Engine specific parameters - 1 - create a common set of >>>>>>>>> parameters >>>>>>>>> >>>>>>>>> Your thoughts? >>>>>>>>> Thanks, >>>>>>>>> Peter >>>>>>>>> >>>>>>>>> [1] - https://github.com/apache/iceberg/issues/10264 >>>>>>>>> [2] - https://github.com/apache/iceberg/pull/11497 >>>>>>>>> [3] - https://github.com/apache/iceberg/pull/11513 >>>>>>>>> >>>>>>>>