Re: Expected behaviour when changing operator parallelism but starting from an incremental checkpoint

Piotr Nowojski Mon, 16 Mar 2020 00:40:28 -0700

Hi Seth,

> Currently, all rescaling operations technically work with checkpoints. That 
> is purely by chance that the implementation supports that, and the line is 
> because the community is not committed to maintaining that functionality


Are you sure that’s the case? Support for rescaling from checkpoint is as far 
as I know, something that we want/need to have:
- if your cluster has just lost a node due to some hardware failure, without 
downscaling support your job will not be able to recover
- future planned life rescaling efforts

Also this [1] seems to contradict your statement?

Lack of support for rescaling for unaligned checkpoints will be hopefully a 
temporarily limitation of the first version and it’s on our roadmap to solve 
this in the future.

Piotrek

[1] 
https://flink.apache.org/features/2017/07/04/flink-rescalable-state.html#rescaling-stateful-stream-processing-jobs
 
<https://flink.apache.org/features/2017/07/04/flink-rescalable-state.html#rescaling-stateful-stream-processing-jobs>

> On 13 Mar 2020, at 17:44, Seth Wiesman <s...@ververica.com> wrote:
> 
> Hi Aaron, 
> 
> Currently, all rescaling operations technically work with checkpoints. That 
> is purely by chance that the implementation supports that, and the line is 
> because the community is not committed to maintaining that functionality. As 
> we add cases, such as unaligned checkpoints, which actually prevent rescaling 
> the documentation will be updated accordingly. FLIP-47 has more to do with 
> consolidating terminology and how actions are triggered and are not 
> particularly relevant to the discussion of rescaling jobs. 
> 
> On Fri, Mar 13, 2020 at 11:39 AM Aaron Levin <aaronle...@stripe.com 
> <mailto:aaronle...@stripe.com>> wrote:
> Hi Piotr,
> 
> Thanks for your response! I understand that checkpoints and savepoints may be 
> diverging (for unaligned checkpoints) but parts also seem to be converging 
> per FLIP-47[0]. Specifically, in FLIP-47 they state that rescaling is 
> "Supported but not in all cases" for checkpoints. What I'm hoping to find is 
> guidance or documentation on when rescaling is supported for checkpoints, 
> and, more importantly, if the cases where it's not supported will result in 
> hard or silent failures. 
> 
> The context here is that we rely on the exactly-once semantics for our Flink 
> jobs in some important systems. In some cases when a job is in a bad state it 
> may not be able to take a checkpoint, but changing the job's parallelism may 
> resolve the issue. Therefore it's important for us to know if deploying from 
> a checkpoint, on purpose or by operator error, will break the semantic 
> guarantees of our job.
> 
> Hard failure in the cases where you cannot change parallelism would be the 
> desired outcome imo.
> 
> Thank you!
> 
> [0] 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-47%3A+Checkpoints+vs.+Savepoints
>  
> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-47%3A+Checkpoints+vs.+Savepoints>
> 
> Best,
> 
> Aaron Levin
> 
> On Fri, Mar 13, 2020 at 9:08 AM Piotr Nowojski <pi...@ververica.com 
> <mailto:pi...@ververica.com>> wrote:
> Hi,
> 
> Generally speaking changes of parallelism is supported between checkpoints 
> and savepoints. Other changes to the job’s topology, like 
> adding/changing/removing operators, changing types in the job graph are only 
> officially supported via savepoints.
> 
> But in reality, as for now, there is no difference between checkpoints and 
> savepoints, but that’s subject to change, so it’s better not to relay this 
> behaviour. For example with unaligned checkpoints [1] (hopefully in 1.11), 
> there will be a difference between those two concepts.
> 
> Piotrek
> 
> [1] 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
>  
> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-76:+Unaligned+Checkpoints>
> 
>> On 12 Mar 2020, at 12:16, Aaron Levin <aaronle...@stripe.com 
>> <mailto:aaronle...@stripe.com>> wrote:
>> 
>> Hi,
>> 
>> What's the expected behaviour of:
>> 
>> * changing an operator's parallelism
>> * deploying this change from an incremental (RocksDB) checkpoint instead of 
>> a savepoint
>> 
>> The flink docs[0][1] are a little unclear on what the expected behaviour is 
>> here. I understand that the key-space is being changed because parallelism 
>> is changed. I've seen instances where this happens and a job does not fail. 
>> But how does it treat potentially missing state for a given key? 
>> 
>> I know I can test this, but I'm curious what the _expected_ behaviour is? 
>> I.e. what behaviour can I rely on, which won't change between versions or 
>> releases? Do we expect the job to fail? Do we expect missing keys to just be 
>> considered empty? 
>> 
>> Thanks!
>> 
>> [0] 
>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#what-is-a-savepoint-how-is-a-savepoint-different-from-a-checkpoint
>>  
>> <https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#what-is-a-savepoint-how-is-a-savepoint-different-from-a-checkpoint>
>> [1] 
>> https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/upgrading.html
>>  
>> <https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/upgrading.html>
>> 
>> Aaron Levin
> 
> 
> 
> -- 
> Seth Wiesman | Solutions Architect
> +1 314 387 1463
> 
>  <https://www.ververica.com/>
> Follow us @VervericaData
> --
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference
> Stream Processing | Event Driven | Real Time
> --
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen

Re: Expected behaviour when changing operator parallelism but starting from an incremental checkpoint

Reply via email to