We could simplify the API a bit, if we omit DeleteFileRewrite.
Since Anton's work around the Puffin delete vectors, this will become
obsolete anyway, and focusing on data file rewriting would allow us to
remove some generics from the API.

WDYT?

Russell Spitzer <russell.spit...@gmail.com> ezt írta (időpont: 2025. jan.
21., K, 17:11):

> To bump this back up, I think this is a pretty important change to the
> core library so it's necessary that we get more folks involved in this
> discussion. I
>
> I agree that the Rewrite Data Files needs to be broken up and realigned if
> we want to be able to reuuse the code in flink.
>
> I think I prefer that we essentially have
>
> Three classes
> 1) RewriteGroup: A Container that holds all the files that are meant to be
> compacted along with information about them
> 2) Rewriter: An engine specific class which knows how to take a
> RewriteGroup and generate new files, I think this should be independent of
> the planner below
> 3) Planner: A Non-Engine specific class which knows how to generate
> RewriteGroups given a set of parameters
>
> On Tue, Jan 14, 2025 at 7:08 AM Péter Váry <peter.vary.apa...@gmail.com>
> wrote:
>
>> Hi Team,
>>
>> There is ongoing work to bring Flink Table Maintenance to Iceberg [1]. We
>> already merged the main infrastructure and are currently working on
>> implementing the data file rewrite [2]. During the implementation we found
>> that part of the compaction planning implemented for Spark compaction,
>> could and should, be reused in Flink as well. Created a PR [3] to bring
>> those changes to the core Iceberg.
>>
>> The main changes in the API:
>>
>>    - We need to separate the companction planning from the rewrite
>>    execution
>>       - The planning would collect the files to be compacted and
>>       organize them to compaction tasks/groups. This could be reused (in the 
>> same
>>       way as the query planning)
>>       - The rewrite would actually execute the rewrite. This needs to
>>       contain engine specific code, so we need to have separate 
>> implementation
>>       for in for the separate engines
>>    - We need to decide on the new compaction planning API
>>
>> The planning currently generates the data for multiple levels:
>>
>>    1. Plan level
>>       - Statistics about the plan:
>>          - Total group count
>>          - Group count in a partition
>>       - Target file size
>>       - Output specification id - only relevant in case of the data
>>       rewrite plan
>>    2. Group level
>>       - General group info
>>          - Global index
>>          - Partition index
>>          - Partition value
>>       - List of tasks to read the data
>>       - Split size - reader input split size when rewriting (Spark
>>       specific)
>>       - Number of expected output files - used to calculate shuffling
>>       partition numbers (Spark specific)
>>
>> I see the following decision points:
>>
>>    - Data organization:
>>       1. Plan is the 'result' - everything below that is only organized
>>       based on the multiplicity of the data. So if some value applies to 
>> every
>>       group, then that value belongs to the 'global' plan variables. If a 
>> value
>>       is different for every group, then that value belongs to the group 
>> (current
>>       code)
>>       2. The group should contain every information which is required
>>       for a single job. So the job (executor) only receives a single group 
>> and
>>       every other bit of information is global. The drawback is that some
>>       information is duplicated, but cleaner on the executor side.
>>    - Parameter handling:
>>       1. Use string maps, like we do with the FileRewriter.options -
>>       this allows for more generic API which will be more stable
>>       2. Use typed, named parameters - when the API is changing the
>>       users might have breaking code, but could easily spot the changes
>>    - Engine specific parameter handling:
>>       1. We generate a common set of parameters
>>       2. Engines get the whole compaction configuration, and can have
>>       their own parameter generators
>>
>> Currently I am leaning towards:
>>
>>    - Data organization - 2 - group should contain every information
>>    - Parameter handling - 2 - specific types and named parameters
>>    - Engine specific parameters - 1 - create a common set of parameters
>>
>> Your thoughts?
>> Thanks,
>> Peter
>>
>> [1] - https://github.com/apache/iceberg/issues/10264
>> [2] - https://github.com/apache/iceberg/pull/11497
>> [3] - https://github.com/apache/iceberg/pull/11513
>>
>

Reply via email to