priate actions (suitable for more expensive
> operations like merging equality deletes into data files, removing orphan
> files)
>
> I would rephrase this a bit differently: Offer an ability to create a
> Flink Table Maintenance service which will listen to the table changes and
>
rase this a bit differently: Offer an ability to create a Flink
Table Maintenance service which will listen to the table changes and
trigger and execute the appropriate actions. I think the main difference
here is that the whole monitoring/scheduling/executing is coupled into a
single job.
> - S
Iceberg tables.
> The `TriggerLockFactory.Lock` interface is specifically designed to allow
> the user to choose the prefered type of locking. I was trying to come up
> with a solution where the Apache Iceberg users don't need to rely on one
> more external system for the F
coordinate
>> externally.
>> [..]
>> > I agree with Ryan that an Iceberg solution is not a good choice here.
>>
>> I agree that we don't need to *rely* on locking in the Iceberg tables.
>> The `TriggerLockFactory.Lock` interface is specifically designed
. It should have a way to coordinate
> externally.
> [..]
> > I agree with Ryan that an Iceberg solution is not a good choice here.
>
> I agree that we don't need to *rely* on locking in the Iceberg tables.
> The `TriggerLockFactory.Lock` interface is specifically designed to allow
&g
actory.Lock` interface is specifically designed to allow the
> user to choose the prefered type of locking. I was trying to come up with a
> solution where the Apache Iceberg users don't need to rely on one more
> external system for the Flink Table Maintenance to work.
>
>
re the Apache Iceberg users don't need to rely on one more
external system for the Flink Table Maintenance to work.
I understand that this discussion is very similar to the HadoopCatalog
situation, where we have a hacky "solution" which is working in some cases,
but suboptimal.
instead of 'tag', and we could use the Catalogs atomic
>> change requirement to make locking atomic. My main concern with this
>> approach is that it relies on the linear history of the table and produces
>> more contention in write side. I was OK with this in the
logs atomic change
> requirement to make locking atomic. My main concern with this approach is
> that it relies on the linear history of the table and produces more
> contention in write side. I was OK with this in the very specific use-case
> with Flink Table Maintenance (few chang
talogs atomic change
requirement to make locking atomic. My main concern with this approach is
that it relies on the linear history of the table and produces more
contention in write side. I was OK with this in the very specific use-case
with Flink Table Maintenance (few changes, controlled time
t is done in the Kafka Connect sink? I think that's a
>> cleaner way to solve the problem if there is not going to be a way to fix
>> it in Flink.
>>
>> Ryan
>>
>> On Wed, Jul 31, 2024 at 7:45 AM Péter Váry
>> wrote:
>>
>>> Hi Team
aner way to solve the problem if there is not going to be a way to fix
> it in Flink.
>
> Ryan
>
> On Wed, Jul 31, 2024 at 7:45 AM Péter Váry
> wrote:
>
>> Hi Team,
>>
>> During the discussion around the Flink Table Maintenance [1], [2], I have
>> highlig
oblem if there is not going to be a way to fix
it in Flink.
Ryan
On Wed, Jul 31, 2024 at 7:45 AM Péter Váry
wrote:
> Hi Team,
>
> During the discussion around the Flink Table Maintenance [1], [2], I have
> highlighted that one of the main decision points is the way we preven
Hi Team,
During the discussion around the Flink Table Maintenance [1], [2], I have
highlighted that one of the main decision points is the way we prevent
concurrent Maintenance Tasks from happening concurrently.
At that time we did not find better solution than providing an interface
for locking
ources
>>>>>> (nodes) as the default jobs.
>>>>>>
>>>>>> 2. resources
>>>>>>> Occupying lots of resources when tasks are idle may be an issue, but
>>>>>>> to execute the tasks periodically in the job
n), which can trigger tasks by
>>>>>> executing CALL statements. In this way resource wasting can be avoided,
>>>>>> but
>>>>>> this is quite a different direction indeed.
>>>>>>
>>>>>
>>>>> CALL statements are
enance tasks by auto scaling. I'm afraid it's not a good
>>>>> idea to rely on it. As far as I know, rescaling it triggers needs to
>>>>> restart the job, which may not be acceptable by most of the users.
>>>>>
>>>>
>>>
>>>> If the table is only operated in one Flink job, we can use
>>>> OperatorCoordinator to coordinate the tasks. OperatorCoordinator may also
>>>> contact each other to ensure different kinds of tasks are executed in
>>>> expected ways
ead the small files, and create a single big file
>> 3. Updater - to collect the new files and update the Iceberg table
>>
>> Is it possible for the Update Operator to communicate with the Planner
>> Operator through the OperatorCoordinator?
>>
>>
>>> While I suppo
nism if we want to
>> coordinate the operation from different engines.
>>
>> (Am I replying to the correct mail?)
>>
>
> Yes :)
>
>>
>> Thanks,
>> Gen
>>
>> On Tue, Apr 9, 2024 at 12:29 AM Brian Olsen
>> wrote:
>>
>>>
anks,
>> Gen
>>
>> On Tue, Apr 9, 2024 at 12:29 AM Brian Olsen
>> wrote:
>>
>>> Hey Iceberg nation,
>>>
>>> I would like to share about the meeting this Wednesday to further
>>> discuss details of Péter's proposal on Flink Maintenance Tasks
asks.
>> Calendar Link: https://calendar.app.google/83HGYWXoQJ8zXuVCA
>>
>> List discussion:
>> https://lists.apache.org/thread/10mdf9zo6pn0dfq791nf4w1m7jh9k3sl
>> <https://www.google.com/url?q=https://lists.apache.org/thread/10mdf9zo6pn0dfq791nf4w1m7jh9k3sl&
.apache.org/thread/10mdf9zo6pn0dfq791nf4w1m7jh9k3sl
> <https://www.google.com/url?q=https://lists.apache.org/thread/10mdf9zo6pn0dfq791nf4w1m7jh9k3sl&sa=D&source=calendar&usd=2&usg=AOvVaw2-aePIRr6APFVHpRDipMgX>
>
> Design Doc: Flink table maintenance
> <https://www.google.com
- Technically would it be possible not to force partition cols into the PK?
I believe this is possible, but probably less performant. It is mentioned
in the docs https://iceberg.apache.org/spec/#scan-planning
>From the documentation:
"An equality delete file must be applied to a data file when a
9k3sl
<https://www.google.com/url?q=https://lists.apache.org/thread/10mdf9zo6pn0dfq791nf4w1m7jh9k3sl&sa=D&source=calendar&usd=2&usg=AOvVaw2-aePIRr6APFVHpRDipMgX>
Design Doc: Flink table maintenance
<https://www.google.com/url?q=https://docs.google.com/document/d/16g3vR18mVBy8j
Hi Peter,
Are you proposing to create a user facing locking feature in Iceberg, or
> just something something for internal use?
>
Since it's a general issue, I'm proposing to create a general user
interface first, while the implementation can be left to users. For
example, we use Airflow to sched
Hi Ajantha,
I thought about enabling post commit topology based compaction for sinks
using options, like we use for the parametrization of streaming reads [1].
I think it will be hard to do it in a user friendly way - because of the
high number of parameters -, but I think it is a possible solutio
Thanks for the proposal Peter.
I just wanted to know do we have any plans for supporting SQL syntax for
table maintenance (like CALL procedure) for pure Flink SQL users?
I didn't see any custom SQL parser plugin support in Flink. I also saw that
Branch write doesn't have SQL support (only Branch r
Hi Manu,
Just to clarify:
- Are you proposing to create a user facing locking feature in Iceberg, or
just something something for internal use?
I think we shouldn't add locking to Iceberg's user facing scope in this
stage. A fully featured locking system has many more features that we need
(prior
>
> What would the community think of exploiting tags for preventing
> concurrent maintenance loop executions.
This issue is not specific to Flink maintenance jobs. We have a service
scheduling Spark maintenance jobs by watching table commits. When we don't
check in-progress maintenance jobs for
What would the community think of exploiting tags for preventing concurrent
maintenance loop executions.
The issue:
Some maintenance tasks couldn't run parallel, like DeleteOrphanFiles vs.
ExpireSnapshots, or RewriteDataFiles vs. RewriteManifestFiles. We make
sure, not to run tasks started by a si
Hi Team,
As discussed on yesterday's community sync, I am working on adding a
possibility to the Flink Iceberg connector to run maintenance tasks on the
Iceberg tables. This will fix the small files issues and in the long run
help compacting the high number of positional and equality deletes creat
32 matches
Mail list logo