Re: FileRewrite API refactor

Russell Spitzer Sat, 01 Feb 2025 03:06:19 -0800

We probably still have to support it as long as we have V2 Table support
right?


On Fri, Jan 31, 2025 at 9:13 AM Péter Váry <peter.vary.apa...@gmail.com>
wrote:

> We could simplify the API a bit, if we omit DeleteFileRewrite.
> Since Anton's work around the Puffin delete vectors, this will become
> obsolete anyway, and focusing on data file rewriting would allow us to
> remove some generics from the API.
>
> WDYT?
>
> Russell Spitzer <russell.spit...@gmail.com> ezt írta (időpont: 2025. jan.
> 21., K, 17:11):
>
>> To bump this back up, I think this is a pretty important change to the
>> core library so it's necessary that we get more folks involved in this
>> discussion. I
>>
>> I agree that the Rewrite Data Files needs to be broken up and realigned
>> if we want to be able to reuuse the code in flink.
>>
>> I think I prefer that we essentially have
>>
>> Three classes
>> 1) RewriteGroup: A Container that holds all the files that are meant to
>> be compacted along with information about them
>> 2) Rewriter: An engine specific class which knows how to take a
>> RewriteGroup and generate new files, I think this should be independent of
>> the planner below
>> 3) Planner: A Non-Engine specific class which knows how to generate
>> RewriteGroups given a set of parameters
>>
>> On Tue, Jan 14, 2025 at 7:08 AM Péter Váry <peter.vary.apa...@gmail.com>
>> wrote:
>>
>>> Hi Team,
>>>
>>> There is ongoing work to bring Flink Table Maintenance to Iceberg [1].
>>> We already merged the main infrastructure and are currently working on
>>> implementing the data file rewrite [2]. During the implementation we found
>>> that part of the compaction planning implemented for Spark compaction,
>>> could and should, be reused in Flink as well. Created a PR [3] to bring
>>> those changes to the core Iceberg.
>>>
>>> The main changes in the API:
>>>
>>>    - We need to separate the companction planning from the rewrite
>>>    execution
>>>       - The planning would collect the files to be compacted and
>>>       organize them to compaction tasks/groups. This could be reused (in 
>>> the same
>>>       way as the query planning)
>>>       - The rewrite would actually execute the rewrite. This needs to
>>>       contain engine specific code, so we need to have separate 
>>> implementation
>>>       for in for the separate engines
>>>    - We need to decide on the new compaction planning API
>>>
>>> The planning currently generates the data for multiple levels:
>>>
>>>    1. Plan level
>>>       - Statistics about the plan:
>>>          - Total group count
>>>          - Group count in a partition
>>>       - Target file size
>>>       - Output specification id - only relevant in case of the data
>>>       rewrite plan
>>>    2. Group level
>>>       - General group info
>>>          - Global index
>>>          - Partition index
>>>          - Partition value
>>>       - List of tasks to read the data
>>>       - Split size - reader input split size when rewriting (Spark
>>>       specific)
>>>       - Number of expected output files - used to calculate shuffling
>>>       partition numbers (Spark specific)
>>>
>>> I see the following decision points:
>>>
>>>    - Data organization:
>>>       1. Plan is the 'result' - everything below that is only organized
>>>       based on the multiplicity of the data. So if some value applies to 
>>> every
>>>       group, then that value belongs to the 'global' plan variables. If a 
>>> value
>>>       is different for every group, then that value belongs to the group 
>>> (current
>>>       code)
>>>       2. The group should contain every information which is required
>>>       for a single job. So the job (executor) only receives a single group 
>>> and
>>>       every other bit of information is global. The drawback is that some
>>>       information is duplicated, but cleaner on the executor side.
>>>    - Parameter handling:
>>>       1. Use string maps, like we do with the FileRewriter.options -
>>>       this allows for more generic API which will be more stable
>>>       2. Use typed, named parameters - when the API is changing the
>>>       users might have breaking code, but could easily spot the changes
>>>    - Engine specific parameter handling:
>>>       1. We generate a common set of parameters
>>>       2. Engines get the whole compaction configuration, and can have
>>>       their own parameter generators
>>>
>>> Currently I am leaning towards:
>>>
>>>    - Data organization - 2 - group should contain every information
>>>    - Parameter handling - 2 - specific types and named parameters
>>>    - Engine specific parameters - 1 - create a common set of parameters
>>>
>>> Your thoughts?
>>> Thanks,
>>> Peter
>>>
>>> [1] - https://github.com/apache/iceberg/issues/10264
>>> [2] - https://github.com/apache/iceberg/pull/11497
>>> [3] - https://github.com/apache/iceberg/pull/11513
>>>
>>

Re: FileRewrite API refactor

Reply via email to