Hi, I'm working on a streaming application using Flink. Several steps in the processing are state-full (I use custom Windows and state-full operators ).
Now if during a normal run an worker fails the checkpointing system will be used to recover. But what if the entire application is stopped (deliberately) or stops/fails because of a problem? At this moment I have three main reasons/causes for doing this: 1) The application just dies because of a bug on my side or a problem like for example this (which I'm actually confronted with): *Failed to Update HDFS Delegation Token for long running application in HA mode * https://issues.apache.org/jira/browse/HDFS-9276 2) I need to rebalance my application (i.e. stop, change parallelism, start) 3) I need a new version of my software to be deployed. (i.e. I fixed a bug, changed the topology and need to continue) I assume the solution will be in some part be specific for my application. The question is what features exist in Flink to support such a clean "continue where I left of" scenario? -- Best regards / Met vriendelijke groeten, Niels Basjes