What would the community think of exploiting tags for preventing concurrent
maintenance loop executions.

The issue:
Some maintenance tasks couldn't run parallel, like DeleteOrphanFiles vs.
ExpireSnapshots, or RewriteDataFiles vs. RewriteManifestFiles. We make
sure, not to run tasks started by a single trigger concurrently by
serializing them, but there are no loops in Flink, so we can't synchronize
tasks started by the next trigger.

In the document, I describe that we need to rely on the user to ensure that
the rate limit is high enough to prevent concurrent triggers.

Proposal:
When firing a trigger, RateLimiter could check and create an Iceberg table
tag [1] for the current table snapshot, with the name:
'__flink_maitenance'. When the execution finishes we remove this tag. The
RateLimiter keep accumulating changes, and doesn't fire new triggers until
it finds this tag on the table.
The solution relies on Flink 'forceNonParallel' to prevent concurrent
execution of placing the tag, and on Iceberg to store it.

This not uses the tags as intended, but seems like a better solution than
adding/removing table properties which would clutter the table history with
technical data.

Your thoughts? Any other suggestions/solutions would be welcome.

Thanks,
Peter

[1]
https://iceberg.apache.org/docs/latest/java-api-quickstart/#branching-and-tagging

On Thu, Mar 28, 2024, 14:44 Péter Váry <peter.vary.apa...@gmail.com> wrote:

> Hi Team,
>
> As discussed on yesterday's community sync, I am working on adding a
> possibility to the Flink Iceberg connector to run maintenance tasks on the
> Iceberg tables. This will fix the small files issues and in the long run
> help compacting the high number of positional and equality deletes created
> by Flink tasks writing CDC data to Iceberg tables without the need of Spark
> in the infrastructure.
>
> I did some planning, prototyping and currently trying out the solution on
> a larger scale.
>
> I put together a document how my current solution looks like:
>
> https://docs.google.com/document/d/16g3vR18mVBy8jbFaLjf2JwAANuYOmIwr15yDDxovdnA/edit?usp=sharing
>
> I would love to hear your thoughts and feedback on this to find a good
> final solution.
>
> Thanks,
> Peter
>

Reply via email to