Re: Expected behaviour when changing operator parallelism but starting from an incremental checkpoint

Aaron Levin Fri, 13 Mar 2020 09:40:02 -0700

Hi Piotr,

Thanks for your response! I understand that checkpoints and savepoints may
be diverging (for unaligned checkpoints) but parts also seem to be
converging per FLIP-47[0]. Specifically, in FLIP-47 they state that
rescaling is "Supported but not in all cases" for checkpoints. What I'm
hoping to find is guidance or documentation on when rescaling is supported
for checkpoints, and, more importantly, if the cases where it's not
supported will result in hard or silent failures.


The context here is that we rely on the exactly-once semantics for our
Flink jobs in some important systems. In some cases when a job is in a bad
state it may not be able to take a checkpoint, but changing the job's
parallelism may resolve the issue. Therefore it's important for us to know
if deploying from a checkpoint, on purpose or by operator error, will break
the semantic guarantees of our job.

Hard failure in the cases where you cannot change parallelism would be the
desired outcome imo.

Thank you!

[0]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-47%3A+Checkpoints+vs.+Savepoints

Best,

Aaron Levin

On Fri, Mar 13, 2020 at 9:08 AM Piotr Nowojski <pi...@ververica.com> wrote:

> Hi,
>
> Generally speaking changes of parallelism is supported between checkpoints
> and savepoints. Other changes to the job’s topology, like
> adding/changing/removing operators, changing types in the job graph are
> only officially supported via savepoints.
>
> But in reality, as for now, there is no difference between checkpoints and
> savepoints, but that’s subject to change, so it’s better not to relay this
> behaviour. For example with unaligned checkpoints [1] (hopefully in 1.11),
> there will be a difference between those two concepts.
>
> Piotrek
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-76:+Unaligned+Checkpoints>
>
> On 12 Mar 2020, at 12:16, Aaron Levin <aaronle...@stripe.com> wrote:
>
> Hi,
>
> What's the expected behaviour of:
>
> * changing an operator's parallelism
> * deploying this change from an incremental (RocksDB) checkpoint instead
> of a savepoint
>
> The flink docs[0][1] are a little unclear on what the expected behaviour
> is here. I understand that the key-space is being changed because
> parallelism is changed. I've seen instances where this happens and a job
> does not fail. But how does it treat potentially missing state for a given
> key?
>
> I know I can test this, but I'm curious what the _expected_ behaviour is?
> I.e. what behaviour can I rely on, which won't change between versions or
> releases? Do we expect the job to fail? Do we expect missing keys to just
> be considered empty?
>
> Thanks!
>
> [0]
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#what-is-a-savepoint-how-is-a-savepoint-different-from-a-checkpoint
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/upgrading.html
>
> Aaron Levin
>
>
>

Re: Expected behaviour when changing operator parallelism but starting from an incremental checkpoint

Reply via email to