Re: [DISCUSS] CORE: Adding support for repartitioning old partition spec data files

Mukund Thakur Wed, 10 Jun 2026 12:19:49 -0700

Could someone take a look at this please and provide some feedback. Thanks.
I am also working on an optimized re partitioning algorithm via PR #16515


On Wed, May 13, 2026 at 1:40 PM Mukund Thakur <[email protected]> wrote:

> Hi Everyone,
> I would like to add support for repartitioning old partition spec data
> files, as described in detail below. Please take a look at my PR
> https://github.com/apache/iceberg/pull/16190.
>
> Improvement
>
> Problem:
> How to efficiently and reliably repartition only the data files belonging
> to the old partition specification so they conform to the new partition
> specification, without unnecessarily rewriting or impacting data files
> already written using the new spec?
>
> Example:
> Suppose we have evolved the table's partition specification by adding a
> new partition field, day, on top of an existing field, month. After a few
> months, we want to re-partition all the old month data files to follow
> partitioning by day. Currently if those files are already of the desired
> data sizes, they won't get picked up and thus will remain partitioned by
> the old spec only.
>
> Explored solution using existing code and feature flags:
>
> As our use case is to rewrite the old partition spec data files to new
> spec data files, we have to use rewrite-all=true as rewrite job will skip
> the files which are already of desired size for example (512 MB by default)
> or if only one file per group rewrite but we would still need to rewrite
> them to desired spec.
> Based on a suggestion by @pvary <https://github.com/pvary> on an old PR 
> ##12083
> (comment)
> <https://github.com/apache/iceberg/pull/12083#issuecomment-2751808447> and
> looking at the current code, I thought we can use filters to filter only
> the old data files after applying rewrite-all=true based on some column
> values for example timestamp( month <=2025-06) for rewriting. To
> efficiently rewrite a huge number of data files we have to also use
> partial-progress.enabled and partial-progress.max-commits such that if job
> fails half way we don't need to start from scratch.
>
> Why this won't work?
>
> Suppose there are so many files to rewrite and jobs fail half way. When we
> rerun using the same filter, it will again pick up the same files even if
> we have rewritten suppose 50% of files successfully. We can somehow improve
> the filter to pick only old files after every iteration but it puts a lot
> of work on end-user as currently we can't filter data files based on the
> spec ID.
>
> Suggested code change
>
> Based on above reasons, I suggest to enable this use case using this new
> flag rewrite-partition-spec-mismatch and partial-progress.enabled.
>
> Happy to try out any other suggestion for achieving the use case.
>
>
> Thanks,
>
> Mukund
>

Re: [DISCUSS] CORE: Adding support for repartitioning old partition spec data files

Reply via email to