Piotr, I think the situation is more nuanced than what you've described.

One concern I have is that unaligned checkpoints are somewhat less flexible
in terms of which operational tasks can be safely performed with them --
i.e., if you look at the table in the docs [1], aligned checkpoints support
arbitrary job upgrades and flink minor version upgrades, and unaligned
checkpoints do not.

The change you propose makes the situation here more delicate, because for
most users, most of their checkpoints will actually be aligned checkpoints
(since their checkpoints will typically not contain any on-the-wire state),
and so these unsupported operations would actually work -- but they could
fail. So if a user is in the habit of doing job upgrades with checkpoints,
and are unaware of the danger posed by the change you propose, and continue
to do these operations afterwards, their upgrades will probably continue to
work -- until someday when they may mysteriously fail.

On a separate point, in the sentence below it seems to me it would be
clearer to say that in the unlikely scenario you've described, the change
would "significantly increase checkpoint sizes" -- assuming I understand
things correctly.

> For those users [the] change to the unaligned checkpoints will
significantly increase state size, without any benefits.

It seems to me that the worst case would be situations where this
increase in checkpoint size causes checkpoint failures because the
available throughput to the checkpoint storage is insufficient to handle
the increase in size, resulting in timeouts where it was (perhaps just
barely) okay before.

Admittedly, this is perhaps a contrived scenario, but it is possible.

I haven't made up my mind about this proposal. Overall I'm unhappy about
the level of complexity we've created, and am trying to figure out if this
proposal makes things better or worse overall. At the moment I'm guessing
it makes things better for a significant minority of users, and worse for a
smaller minority.

[1]
https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/ops/state/checkpoints_vs_savepoints/#capabilities-and-limitations

David

On Fri, Jan 5, 2024 at 5:42 AM Piotr Nowojski <pnowoj...@apache.org> wrote:

> Ops, fixing the topic.
>
> Hi!
> >
> > I would like to propose by default to enable unaligned checkpoints and
> > also simultaneously increase the aligned checkpoints timeout from 0ms to
> > 5s. I think this change is the right one to do for the majority of Flink
> > users.
> >
> > For more rationale please take a look into the short FLIP-413 [1].
> >
> > What do you all think?
> >
> > Best,
> > Piotrek
> >
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-413%3A+Enable+unaligned+checkpoints+by+default
> >
>

Reply via email to