Re: FileRewrite API refactor

Péter Váry Tue, 18 Feb 2025 02:11:48 -0800

Thanks for all the feedback!

Create the PR for the API discussion:
https://github.com/apache/iceberg/pull/12306


Thanks,
Peter

Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2025. febr. 13., Cs,
22:28):

> looking at "RewriteDataFilesSparkAction" from your PR #11513, I am fine
> that the RewriteExecutionContext is captured in the `Plan` object. My
> earlier point is that we need to pass those common metadata/context to the
> executor. We don't have to define a separate `PlanInfo` for that purpose if
> they are captured in the `Plan` already.
>
> [image: image.png]
>
> My other point is that those context/plan info may not need to be
> duplicated. E.g., currently the "RewriteExecutionContext" look duplicated
> in "RewriteDataFileSparkAction" and "RewritePositionDeleteFilesSparkAction"
> classes
>
> On Fri, Feb 7, 2025 at 4:56 AM Péter Váry <peter.vary.apa...@gmail.com>
> wrote:
>
>> Hi Steven,
>>
>> Thanks for checking this out!
>>
>> Created commits which contain only the API changes:
>>
>>    1. Everything is stored on the same level as in the current API:
>>    
>> https://github.com/apache/iceberg/commit/8d612e074dcb8ee6d5ae354e329d0b78e3138c86
>>    2. Simplified the API to push down everything to the FileGroupLevel:
>>    
>> https://github.com/apache/iceberg/commit/26e4704bba08a05d629c61d7eb785bb4e66ce8df
>>
>> This is the diff between the 2 options:
>> https://github.com/apache/iceberg/commit/967b0e5a9b52ee0e303578fccf1d1d538e4689d9
>> if you would like to see it.
>>
>> I prefer the 2nd option which has fewer classes, and consider pushing
>> down the outputSpecId to the RewriterFileGroup a fair trade for it.
>> I would even prefer to merge the RewritePositionDeleteFiles#FileGroupInfo
>> and RewriteFileGroup#FileGroupInfo if the community would not find that too
>> invasive.
>>
>> If we agree on the general approach, I could create a PR for the API
>> refactor first, and we can move forward from there.
>> What do you think?
>>
>> Thanks,
>> Peter
>>
>> Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2025. febr. 6., Cs,
>> 17:50):
>>
>>> To be honest, it is hard to visualize the interface/structure discussion
>>> in this format. More details (in a doc or PR) can be helpful.
>>>
>>> Regarding "data organization", I feel we probably can set one common
>>> metadata class for all action types. We don't need both
>>> RewriteFilePlanContext and RewritePositionDeletePlanContext. It is kind of
>>> like the "TableScanContext" in the core module. It could be just a super
>>> set of all needed fields.
>>>
>>>
>>> On Wed, Feb 5, 2025 at 2:31 AM Péter Váry <peter.vary.apa...@gmail.com>
>>> wrote:
>>>
>>>> Hi Steven,
>>>>
>>>> Thanks for chiming in!
>>>>
>>>> The decision points I have collected:
>>>>
>>>>    - Data organization
>>>>       1. Plan + Group
>>>>       2. Group only
>>>>    - Parameter handling
>>>>       1. All strings
>>>>       2. Type params
>>>>    - Engine specific parameters for the executor
>>>>       1. Common set calculated by the planner
>>>>       2. Engine gets all user params and recalculates what is needed
>>>>
>>>> If I understand correctly your answer for the "Data organization"
>>>> question is to use multiple objects (like PlanContext and Group).
>>>> Do you have a strong opinion about the other questions?
>>>>
>>>> I am a bit concerned about the number of different classes for this
>>>> API. Currently (before this change) we
>>>> have RewriteFileGroup/RewritePositionDeletesGroup. If we
>>>> introduce RewriteFilePlan/RewritePositionDeletePlan and also
>>>> RewriteFilePlanContext/RewritePositionDeletePlanContext we will have many
>>>> interfaces/classes with similar implementations. This will make it hard to
>>>> understand the code. If we push all of the differences down the group level
>>>> then we might be better off pushing the writeMaxFileSize to the group level
>>>> as well. This way we can get rid of the PlanContext altogether, and only
>>>> have a single generic Plan object.
>>>>
>>>> Thanks,
>>>> Peter
>>>>
>>>> Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2025. febr. 5.,
>>>> Sze, 0:18):
>>>>
>>>>> At a high level, it makes sense to separate out the planning and
>>>>> execution to promote reusing the planning code across engines.
>>>>>
>>>>> Just to add 4th class to Russel's list
>>>>> 1) RewriteGroup: A Container that holds all the files that are meant
>>>>> to be compacted along with information about them
>>>>> 2) Rewriter: An engine specific class which knows how to take a
>>>>> RewriteGroup and generate new files, I think this should be independent of
>>>>> the planner below
>>>>> 3) Planner: A Non-Engine specific class which knows how to generate
>>>>> RewriteGroups given a set of parameters
>>>>> 4) RewritePlan: A Non-Engine specific class for the planner output. it
>>>>> contains a list of RewriteGroups and common metadata class `PlanContext`.
>>>>>
>>>>> Rewriter can take in `RewriteGruop` (a list of files to be compacted)
>>>>>  and `PlanContext` (common metadata)
>>>>>
>>>>>
>>>>> On Sat, Feb 1, 2025 at 3:06 AM Russell Spitzer <
>>>>> russell.spit...@gmail.com> wrote:
>>>>>
>>>>>> We probably still have to support it as long as we have V2 Table
>>>>>> support right?
>>>>>>
>>>>>> On Fri, Jan 31, 2025 at 9:13 AM Péter Váry <
>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>
>>>>>>> We could simplify the API a bit, if we omit DeleteFileRewrite.
>>>>>>> Since Anton's work around the Puffin delete vectors, this will
>>>>>>> become obsolete anyway, and focusing on data file rewriting would allow 
>>>>>>> us
>>>>>>> to remove some generics from the API.
>>>>>>>
>>>>>>> WDYT?
>>>>>>>
>>>>>>> Russell Spitzer <russell.spit...@gmail.com> ezt írta (időpont:
>>>>>>> 2025. jan. 21., K, 17:11):
>>>>>>>
>>>>>>>> To bump this back up, I think this is a pretty important change to
>>>>>>>> the core library so it's necessary that we get more folks involved in 
>>>>>>>> this
>>>>>>>> discussion. I
>>>>>>>>
>>>>>>>> I agree that the Rewrite Data Files needs to be broken up and
>>>>>>>> realigned if we want to be able to reuuse the code in flink.
>>>>>>>>
>>>>>>>> I think I prefer that we essentially have
>>>>>>>>
>>>>>>>> Three classes
>>>>>>>> 1) RewriteGroup: A Container that holds all the files that are
>>>>>>>> meant to be compacted along with information about them
>>>>>>>> 2) Rewriter: An engine specific class which knows how to take a
>>>>>>>> RewriteGroup and generate new files, I think this should be 
>>>>>>>> independent of
>>>>>>>> the planner below
>>>>>>>> 3) Planner: A Non-Engine specific class which knows how to generate
>>>>>>>> RewriteGroups given a set of parameters
>>>>>>>>
>>>>>>>> On Tue, Jan 14, 2025 at 7:08 AM Péter Váry <
>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Team,
>>>>>>>>>
>>>>>>>>> There is ongoing work to bring Flink Table Maintenance to Iceberg
>>>>>>>>> [1]. We already merged the main infrastructure and are currently 
>>>>>>>>> working on
>>>>>>>>> implementing the data file rewrite [2]. During the implementation we 
>>>>>>>>> found
>>>>>>>>> that part of the compaction planning implemented for Spark compaction,
>>>>>>>>> could and should, be reused in Flink as well. Created a PR [3] to 
>>>>>>>>> bring
>>>>>>>>> those changes to the core Iceberg.
>>>>>>>>>
>>>>>>>>> The main changes in the API:
>>>>>>>>>
>>>>>>>>>    - We need to separate the companction planning from the
>>>>>>>>>    rewrite execution
>>>>>>>>>       - The planning would collect the files to be compacted and
>>>>>>>>>       organize them to compaction tasks/groups. This could be reused 
>>>>>>>>> (in the same
>>>>>>>>>       way as the query planning)
>>>>>>>>>       - The rewrite would actually execute the rewrite. This
>>>>>>>>>       needs to contain engine specific code, so we need to have 
>>>>>>>>> separate
>>>>>>>>>       implementation for in for the separate engines
>>>>>>>>>    - We need to decide on the new compaction planning API
>>>>>>>>>
>>>>>>>>> The planning currently generates the data for multiple levels:
>>>>>>>>>
>>>>>>>>>    1. Plan level
>>>>>>>>>       - Statistics about the plan:
>>>>>>>>>          - Total group count
>>>>>>>>>          - Group count in a partition
>>>>>>>>>       - Target file size
>>>>>>>>>       - Output specification id - only relevant in case of the
>>>>>>>>>       data rewrite plan
>>>>>>>>>    2. Group level
>>>>>>>>>       - General group info
>>>>>>>>>          - Global index
>>>>>>>>>          - Partition index
>>>>>>>>>          - Partition value
>>>>>>>>>       - List of tasks to read the data
>>>>>>>>>       - Split size - reader input split size when rewriting
>>>>>>>>>       (Spark specific)
>>>>>>>>>       - Number of expected output files - used to calculate
>>>>>>>>>       shuffling partition numbers (Spark specific)
>>>>>>>>>
>>>>>>>>> I see the following decision points:
>>>>>>>>>
>>>>>>>>>    - Data organization:
>>>>>>>>>       1. Plan is the 'result' - everything below that is only
>>>>>>>>>       organized based on the multiplicity of the data. So if some 
>>>>>>>>> value applies
>>>>>>>>>       to every group, then that value belongs to the 'global' plan 
>>>>>>>>> variables. If
>>>>>>>>>       a value is different for every group, then that value belongs 
>>>>>>>>> to the group
>>>>>>>>>       (current code)
>>>>>>>>>       2. The group should contain every information which is
>>>>>>>>>       required for a single job. So the job (executor) only receives 
>>>>>>>>> a single
>>>>>>>>>       group and every other bit of information is global. The 
>>>>>>>>> drawback is that
>>>>>>>>>       some information is duplicated, but cleaner on the executor 
>>>>>>>>> side.
>>>>>>>>>    - Parameter handling:
>>>>>>>>>       1. Use string maps, like we do with the
>>>>>>>>>       FileRewriter.options - this allows for more generic API which 
>>>>>>>>> will be more
>>>>>>>>>       stable
>>>>>>>>>       2. Use typed, named parameters - when the API is changing
>>>>>>>>>       the users might have breaking code, but could easily spot the 
>>>>>>>>> changes
>>>>>>>>>    - Engine specific parameter handling:
>>>>>>>>>       1. We generate a common set of parameters
>>>>>>>>>       2. Engines get the whole compaction configuration, and can
>>>>>>>>>       have their own parameter generators
>>>>>>>>>
>>>>>>>>> Currently I am leaning towards:
>>>>>>>>>
>>>>>>>>>    - Data organization - 2 - group should contain every
>>>>>>>>>    information
>>>>>>>>>    - Parameter handling - 2 - specific types and named parameters
>>>>>>>>>    - Engine specific parameters - 1 - create a common set of
>>>>>>>>>    parameters
>>>>>>>>>
>>>>>>>>> Your thoughts?
>>>>>>>>> Thanks,
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>> [1] - https://github.com/apache/iceberg/issues/10264
>>>>>>>>> [2] - https://github.com/apache/iceberg/pull/11497
>>>>>>>>> [3] - https://github.com/apache/iceberg/pull/11513
>>>>>>>>>
>>>>>>>>

Re: FileRewrite API refactor

Reply via email to