In line with what David said, after having to explain the (often subtle) issues 
around unaligned checkpoints and upgrades while teaching Flink, I would also be 
concerned about enabling it by default.

Would it be better to provide more automatic detection of situations where 
unaligned checkpoints helped, along with appropriate warnings?

— Ken

PS - and I hope I’m not banging on a lonely drum, but Fury supports schema 
evolution and is faster than the POJO serializer…so if we switched to that, we 
could in theory support evolution of checkpoints that contain on-the-wire 
records.

> On Jan 7, 2024, at 9:52 AM, David Anderson <dander...@apache.org> wrote:
> 
> Piotr, I think the situation is more nuanced than what you've described.
> 
> One concern I have is that unaligned checkpoints are somewhat less flexible
> in terms of which operational tasks can be safely performed with them --
> i.e., if you look at the table in the docs [1], aligned checkpoints support
> arbitrary job upgrades and flink minor version upgrades, and unaligned
> checkpoints do not.
> 
> The change you propose makes the situation here more delicate, because for
> most users, most of their checkpoints will actually be aligned checkpoints
> (since their checkpoints will typically not contain any on-the-wire state),
> and so these unsupported operations would actually work -- but they could
> fail. So if a user is in the habit of doing job upgrades with checkpoints,
> and are unaware of the danger posed by the change you propose, and continue
> to do these operations afterwards, their upgrades will probably continue to
> work -- until someday when they may mysteriously fail.
> 
> On a separate point, in the sentence below it seems to me it would be
> clearer to say that in the unlikely scenario you've described, the change
> would "significantly increase checkpoint sizes" -- assuming I understand
> things correctly.
> 
>> For those users [the] change to the unaligned checkpoints will
> significantly increase state size, without any benefits.
> 
> It seems to me that the worst case would be situations where this
> increase in checkpoint size causes checkpoint failures because the
> available throughput to the checkpoint storage is insufficient to handle
> the increase in size, resulting in timeouts where it was (perhaps just
> barely) okay before.
> 
> Admittedly, this is perhaps a contrived scenario, but it is possible.
> 
> I haven't made up my mind about this proposal. Overall I'm unhappy about
> the level of complexity we've created, and am trying to figure out if this
> proposal makes things better or worse overall. At the moment I'm guessing
> it makes things better for a significant minority of users, and worse for a
> smaller minority.
> 
> [1]
> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/ops/state/checkpoints_vs_savepoints/#capabilities-and-limitations
> 
> David
> 
> On Fri, Jan 5, 2024 at 5:42 AM Piotr Nowojski <pnowoj...@apache.org> wrote:
> 
>> Ops, fixing the topic.
>> 
>> Hi!
>>> 
>>> I would like to propose by default to enable unaligned checkpoints and
>>> also simultaneously increase the aligned checkpoints timeout from 0ms to
>>> 5s. I think this change is the right one to do for the majority of Flink
>>> users.
>>> 
>>> For more rationale please take a look into the short FLIP-413 [1].
>>> 
>>> What do you all think?
>>> 
>>> Best,
>>> Piotrek
>>> 
>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-413%3A+Enable+unaligned+checkpoints+by+default
>>> 
>> 

--------------------------
Ken Krugler
http://www.scaleunlimited.com
Custom big data solutions
Flink & Pinot



Reply via email to