Piotr, I think the situation is more nuanced than what you've described. One concern I have is that unaligned checkpoints are somewhat less flexible in terms of which operational tasks can be safely performed with them -- i.e., if you look at the table in the docs [1], aligned checkpoints support arbitrary job upgrades and flink minor version upgrades, and unaligned checkpoints do not.
The change you propose makes the situation here more delicate, because for most users, most of their checkpoints will actually be aligned checkpoints (since their checkpoints will typically not contain any on-the-wire state), and so these unsupported operations would actually work -- but they could fail. So if a user is in the habit of doing job upgrades with checkpoints, and are unaware of the danger posed by the change you propose, and continue to do these operations afterwards, their upgrades will probably continue to work -- until someday when they may mysteriously fail. On a separate point, in the sentence below it seems to me it would be clearer to say that in the unlikely scenario you've described, the change would "significantly increase checkpoint sizes" -- assuming I understand things correctly. > For those users [the] change to the unaligned checkpoints will significantly increase state size, without any benefits. It seems to me that the worst case would be situations where this increase in checkpoint size causes checkpoint failures because the available throughput to the checkpoint storage is insufficient to handle the increase in size, resulting in timeouts where it was (perhaps just barely) okay before. Admittedly, this is perhaps a contrived scenario, but it is possible. I haven't made up my mind about this proposal. Overall I'm unhappy about the level of complexity we've created, and am trying to figure out if this proposal makes things better or worse overall. At the moment I'm guessing it makes things better for a significant minority of users, and worse for a smaller minority. [1] https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/ops/state/checkpoints_vs_savepoints/#capabilities-and-limitations David On Fri, Jan 5, 2024 at 5:42 AM Piotr Nowojski <pnowoj...@apache.org> wrote: > Ops, fixing the topic. > > Hi! > > > > I would like to propose by default to enable unaligned checkpoints and > > also simultaneously increase the aligned checkpoints timeout from 0ms to > > 5s. I think this change is the right one to do for the majority of Flink > > users. > > > > For more rationale please take a look into the short FLIP-413 [1]. > > > > What do you all think? > > > > Best, > > Piotrek > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-413%3A+Enable+unaligned+checkpoints+by+default > > >