Re: FileRewrite API refactor

Péter Váry Fri, 07 Feb 2025 04:56:14 -0800

Hi Steven,

Thanks for checking this out!


Created commits which contain only the API changes:

   1. Everything is stored on the same level as in the current API:
   
https://github.com/apache/iceberg/commit/8d612e074dcb8ee6d5ae354e329d0b78e3138c86
   2. Simplified the API to push down everything to the FileGroupLevel:
   
https://github.com/apache/iceberg/commit/26e4704bba08a05d629c61d7eb785bb4e66ce8df

This is the diff between the 2 options:
https://github.com/apache/iceberg/commit/967b0e5a9b52ee0e303578fccf1d1d538e4689d9
if you would like to see it.

I prefer the 2nd option which has fewer classes, and consider pushing down
the outputSpecId to the RewriterFileGroup a fair trade for it.
I would even prefer to merge the RewritePositionDeleteFiles#FileGroupInfo
and RewriteFileGroup#FileGroupInfo if the community would not find that too
invasive.

If we agree on the general approach, I could create a PR for the API
refactor first, and we can move forward from there.
What do you think?

Thanks,
Peter

Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2025. febr. 6., Cs,
17:50):

> To be honest, it is hard to visualize the interface/structure discussion
> in this format. More details (in a doc or PR) can be helpful.
>
> Regarding "data organization", I feel we probably can set one common
> metadata class for all action types. We don't need both
> RewriteFilePlanContext and RewritePositionDeletePlanContext. It is kind of
> like the "TableScanContext" in the core module. It could be just a super
> set of all needed fields.
>
>
> On Wed, Feb 5, 2025 at 2:31 AM Péter Váry <peter.vary.apa...@gmail.com>
> wrote:
>
>> Hi Steven,
>>
>> Thanks for chiming in!
>>
>> The decision points I have collected:
>>
>>    - Data organization
>>       1. Plan + Group
>>       2. Group only
>>    - Parameter handling
>>       1. All strings
>>       2. Type params
>>    - Engine specific parameters for the executor
>>       1. Common set calculated by the planner
>>       2. Engine gets all user params and recalculates what is needed
>>
>> If I understand correctly your answer for the "Data organization"
>> question is to use multiple objects (like PlanContext and Group).
>> Do you have a strong opinion about the other questions?
>>
>> I am a bit concerned about the number of different classes for this API.
>> Currently (before this change) we
>> have RewriteFileGroup/RewritePositionDeletesGroup. If we
>> introduce RewriteFilePlan/RewritePositionDeletePlan and also
>> RewriteFilePlanContext/RewritePositionDeletePlanContext we will have many
>> interfaces/classes with similar implementations. This will make it hard to
>> understand the code. If we push all of the differences down the group level
>> then we might be better off pushing the writeMaxFileSize to the group level
>> as well. This way we can get rid of the PlanContext altogether, and only
>> have a single generic Plan object.
>>
>> Thanks,
>> Peter
>>
>> Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2025. febr. 5., Sze,
>> 0:18):
>>
>>> At a high level, it makes sense to separate out the planning and
>>> execution to promote reusing the planning code across engines.
>>>
>>> Just to add 4th class to Russel's list
>>> 1) RewriteGroup: A Container that holds all the files that are meant to
>>> be compacted along with information about them
>>> 2) Rewriter: An engine specific class which knows how to take a
>>> RewriteGroup and generate new files, I think this should be independent of
>>> the planner below
>>> 3) Planner: A Non-Engine specific class which knows how to generate
>>> RewriteGroups given a set of parameters
>>> 4) RewritePlan: A Non-Engine specific class for the planner output. it
>>> contains a list of RewriteGroups and common metadata class `PlanContext`.
>>>
>>> Rewriter can take in `RewriteGruop` (a list of files to be compacted)
>>>  and `PlanContext` (common metadata)
>>>
>>>
>>> On Sat, Feb 1, 2025 at 3:06 AM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> We probably still have to support it as long as we have V2 Table
>>>> support right?
>>>>
>>>> On Fri, Jan 31, 2025 at 9:13 AM Péter Váry <peter.vary.apa...@gmail.com>
>>>> wrote:
>>>>
>>>>> We could simplify the API a bit, if we omit DeleteFileRewrite.
>>>>> Since Anton's work around the Puffin delete vectors, this will become
>>>>> obsolete anyway, and focusing on data file rewriting would allow us to
>>>>> remove some generics from the API.
>>>>>
>>>>> WDYT?
>>>>>
>>>>> Russell Spitzer <russell.spit...@gmail.com> ezt írta (időpont: 2025.
>>>>> jan. 21., K, 17:11):
>>>>>
>>>>>> To bump this back up, I think this is a pretty important change to
>>>>>> the core library so it's necessary that we get more folks involved in 
>>>>>> this
>>>>>> discussion. I
>>>>>>
>>>>>> I agree that the Rewrite Data Files needs to be broken up and
>>>>>> realigned if we want to be able to reuuse the code in flink.
>>>>>>
>>>>>> I think I prefer that we essentially have
>>>>>>
>>>>>> Three classes
>>>>>> 1) RewriteGroup: A Container that holds all the files that are meant
>>>>>> to be compacted along with information about them
>>>>>> 2) Rewriter: An engine specific class which knows how to take a
>>>>>> RewriteGroup and generate new files, I think this should be independent 
>>>>>> of
>>>>>> the planner below
>>>>>> 3) Planner: A Non-Engine specific class which knows how to generate
>>>>>> RewriteGroups given a set of parameters
>>>>>>
>>>>>> On Tue, Jan 14, 2025 at 7:08 AM Péter Váry <
>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Team,
>>>>>>>
>>>>>>> There is ongoing work to bring Flink Table Maintenance to Iceberg
>>>>>>> [1]. We already merged the main infrastructure and are currently 
>>>>>>> working on
>>>>>>> implementing the data file rewrite [2]. During the implementation we 
>>>>>>> found
>>>>>>> that part of the compaction planning implemented for Spark compaction,
>>>>>>> could and should, be reused in Flink as well. Created a PR [3] to bring
>>>>>>> those changes to the core Iceberg.
>>>>>>>
>>>>>>> The main changes in the API:
>>>>>>>
>>>>>>>    - We need to separate the companction planning from the rewrite
>>>>>>>    execution
>>>>>>>       - The planning would collect the files to be compacted and
>>>>>>>       organize them to compaction tasks/groups. This could be reused 
>>>>>>> (in the same
>>>>>>>       way as the query planning)
>>>>>>>       - The rewrite would actually execute the rewrite. This needs
>>>>>>>       to contain engine specific code, so we need to have separate 
>>>>>>> implementation
>>>>>>>       for in for the separate engines
>>>>>>>    - We need to decide on the new compaction planning API
>>>>>>>
>>>>>>> The planning currently generates the data for multiple levels:
>>>>>>>
>>>>>>>    1. Plan level
>>>>>>>       - Statistics about the plan:
>>>>>>>          - Total group count
>>>>>>>          - Group count in a partition
>>>>>>>       - Target file size
>>>>>>>       - Output specification id - only relevant in case of the data
>>>>>>>       rewrite plan
>>>>>>>    2. Group level
>>>>>>>       - General group info
>>>>>>>          - Global index
>>>>>>>          - Partition index
>>>>>>>          - Partition value
>>>>>>>       - List of tasks to read the data
>>>>>>>       - Split size - reader input split size when rewriting (Spark
>>>>>>>       specific)
>>>>>>>       - Number of expected output files - used to calculate
>>>>>>>       shuffling partition numbers (Spark specific)
>>>>>>>
>>>>>>> I see the following decision points:
>>>>>>>
>>>>>>>    - Data organization:
>>>>>>>       1. Plan is the 'result' - everything below that is only
>>>>>>>       organized based on the multiplicity of the data. So if some value 
>>>>>>> applies
>>>>>>>       to every group, then that value belongs to the 'global' plan 
>>>>>>> variables. If
>>>>>>>       a value is different for every group, then that value belongs to 
>>>>>>> the group
>>>>>>>       (current code)
>>>>>>>       2. The group should contain every information which is
>>>>>>>       required for a single job. So the job (executor) only receives a 
>>>>>>> single
>>>>>>>       group and every other bit of information is global. The drawback 
>>>>>>> is that
>>>>>>>       some information is duplicated, but cleaner on the executor side.
>>>>>>>    - Parameter handling:
>>>>>>>       1. Use string maps, like we do with the FileRewriter.options
>>>>>>>       - this allows for more generic API which will be more stable
>>>>>>>       2. Use typed, named parameters - when the API is changing the
>>>>>>>       users might have breaking code, but could easily spot the changes
>>>>>>>    - Engine specific parameter handling:
>>>>>>>       1. We generate a common set of parameters
>>>>>>>       2. Engines get the whole compaction configuration, and can
>>>>>>>       have their own parameter generators
>>>>>>>
>>>>>>> Currently I am leaning towards:
>>>>>>>
>>>>>>>    - Data organization - 2 - group should contain every information
>>>>>>>    - Parameter handling - 2 - specific types and named parameters
>>>>>>>    - Engine specific parameters - 1 - create a common set of
>>>>>>>    parameters
>>>>>>>
>>>>>>> Your thoughts?
>>>>>>> Thanks,
>>>>>>> Peter
>>>>>>>
>>>>>>> [1] - https://github.com/apache/iceberg/issues/10264
>>>>>>> [2] - https://github.com/apache/iceberg/pull/11497
>>>>>>> [3] - https://github.com/apache/iceberg/pull/11513
>>>>>>>
>>>>>>

Re: FileRewrite API refactor

Reply via email to