In line with what David said, after having to explain the (often subtle) issues around unaligned checkpoints and upgrades while teaching Flink, I would also be concerned about enabling it by default.
Would it be better to provide more automatic detection of situations where unaligned checkpoints helped, along with appropriate warnings? — Ken PS - and I hope I’m not banging on a lonely drum, but Fury supports schema evolution and is faster than the POJO serializer…so if we switched to that, we could in theory support evolution of checkpoints that contain on-the-wire records. > On Jan 7, 2024, at 9:52 AM, David Anderson <dander...@apache.org> wrote: > > Piotr, I think the situation is more nuanced than what you've described. > > One concern I have is that unaligned checkpoints are somewhat less flexible > in terms of which operational tasks can be safely performed with them -- > i.e., if you look at the table in the docs [1], aligned checkpoints support > arbitrary job upgrades and flink minor version upgrades, and unaligned > checkpoints do not. > > The change you propose makes the situation here more delicate, because for > most users, most of their checkpoints will actually be aligned checkpoints > (since their checkpoints will typically not contain any on-the-wire state), > and so these unsupported operations would actually work -- but they could > fail. So if a user is in the habit of doing job upgrades with checkpoints, > and are unaware of the danger posed by the change you propose, and continue > to do these operations afterwards, their upgrades will probably continue to > work -- until someday when they may mysteriously fail. > > On a separate point, in the sentence below it seems to me it would be > clearer to say that in the unlikely scenario you've described, the change > would "significantly increase checkpoint sizes" -- assuming I understand > things correctly. > >> For those users [the] change to the unaligned checkpoints will > significantly increase state size, without any benefits. > > It seems to me that the worst case would be situations where this > increase in checkpoint size causes checkpoint failures because the > available throughput to the checkpoint storage is insufficient to handle > the increase in size, resulting in timeouts where it was (perhaps just > barely) okay before. > > Admittedly, this is perhaps a contrived scenario, but it is possible. > > I haven't made up my mind about this proposal. Overall I'm unhappy about > the level of complexity we've created, and am trying to figure out if this > proposal makes things better or worse overall. At the moment I'm guessing > it makes things better for a significant minority of users, and worse for a > smaller minority. > > [1] > https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/ops/state/checkpoints_vs_savepoints/#capabilities-and-limitations > > David > > On Fri, Jan 5, 2024 at 5:42 AM Piotr Nowojski <pnowoj...@apache.org> wrote: > >> Ops, fixing the topic. >> >> Hi! >>> >>> I would like to propose by default to enable unaligned checkpoints and >>> also simultaneously increase the aligned checkpoints timeout from 0ms to >>> 5s. I think this change is the right one to do for the majority of Flink >>> users. >>> >>> For more rationale please take a look into the short FLIP-413 [1]. >>> >>> What do you all think? >>> >>> Best, >>> Piotrek >>> >>> >>> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-413%3A+Enable+unaligned+checkpoints+by+default >>> >> -------------------------- Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink & Pinot