Hi all, Please allow me to throw some points in combination of FLIP-45 [1] for discussing, and please don't be confused if some of them are inconsistent or even opposite to current proposals in FLIP-47 (with me as a co-author), because as Kostas pointed out, the discussion is still in progress and hasn't reached to a consensus, but we all agreed to move it forward to public to collect more feedbacks.
FLIP-45 and FLIP-47 all touches the checkpoint and savepoint concept clean up but in two different ways, and below are my understanding about their variance and pros/cons: * FLIP-45 proposes to map the concepts of Flink checkpoint and savepoint to database checkpoint and backup, furthermore the periodic system-triggered checkpoint to flurry [2] checkpoint and the stop-with-checkpoint to sharp [3] checkpoint. And mentions whether we should introduce a Flink concept relative to database snapshot, which IMHO we could use FLIP-47 as a good start for discussion. - Pros - No change from user perspective, both conceptually and physically, thus no additional education cost. (Semantic correction are mainly for developer to understand) - Concept mapping to a mature system (database) could help to make it clear, as well as facilitating implement and explain db-like functions in future, such as FLIP-43 [4] and streaming ledger [5] - Cons - Less beneficial for developers with no database experience (need to learn database concepts to understand Flink's) - One may argue that Flink is Flink (stream processing engine), not database * FLIP-47 proposes to unify the concepts of Flink checkpoint and savepoint to snapshot, with a unified command. - Pros - Pure Flink concepts, no additional cost to learn/compare concepts in other systems - Unified semantic from developer perspective - Cons - Detectable change from user perspective, need to re-map the existing checkpoint/savepoint use cases to new commands - Currently: checkpoint for failover, savepoint for upgrade/state-migration/switch-backend/import-export/blue-red-deployment - Future: every use case to newly introduced command, for example (the command format is just pseudo): - Command format: - take snapshot [mode] [format] - mode: full(default), incremental - format: UNIFIED, DEFAULT(default, backend specified) Use case: - Resume after stop/cancel: take snapshot incremental DEFAULT - Upgrade: take snapshot full DEFAULT - State migration: take snapshot full DEFAULT - Switch backend: take snapshot full UNIFIED - Blue/red deployment: take snapshot incremental DEFAULT - No new functionality supplied but requires user action And please correct me or give supplements if I've stated anything wrong/missed anything @Kostas @Aljoscha @Konstantin. Thanks! [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-45%3A+Reinforce+Job+Stop+Semantic [2] https://dev.mysql.com/doc/refman/8.0/en/glossary.html#glos_fuzzy_checkpointing [3] https://dev.mysql.com/doc/refman/8.0/en/glossary.html#glos_sharp_checkpoint [4] https://cwiki.apache.org/confluence/display/FLINK/FLIP-43%3A+State+Processor+API [5] https://github.com/dataArtisans/da-streamingledger Best Regards, Yu On Wed, 10 Jul 2019 at 15:02, Congxian Qiu <qcx978132...@gmail.com> wrote: > Hi Kostas > > Thanks for bringing this up. Currently, there are indeed some overlaps > between checkpoint and savepoint that will make user confused. I think the > FLIP's proposal can give users a clearer description. > > About the FLIP, I have a question about “Deleting or moving a snapshot > must be done by Flink", seems like we will support MOVE/DELETE the stopped > job's snapshot. What should the user do when he/she wants to DELETE/MOVE > a stopped job's snapshot > > Best, > Congxian > > > Becket Qin <becket....@gmail.com> 于2019年7月10日周三 上午9:33写道: > > > Hi Kostas, > > > > It makes a lot of sense to just have one underlying mechanism (snapshot) > to > > save the state of a Flink job. And we can use that mechanism in different > > scenarios, including checkpoint and user-triggered savepoint. > > > > To facilitate the discussion, maybe it is useful to clarify a few design > > goals, for example: > > > > 1. one unified snapshot format that supports > > - both incremental and global state saving > > - rescaling on recovery > > - compatibility check / migration across different Flink versions? > > 2. The snapshot can easily be managed by users. > > > > > > And I have two questions regarding the FLIP. > > > > 1. What are the side-effects when taking a snapshot? Do you mean taking > > snapshot may triggers some action other than saving the state of the Job. > > Technically speaking, taking snapshot should be a "read-only" action to > the > > Flink jobs. So I assume by side-effects, you meant it's no-longer > > read-only. If so, can you be more specific on what are the side-effects > you > > are referring to? > > > > 2. In the rejected alternative, you mentioned a scenario of AB testing. > It > > seems that if execution A and execution B runs different configurations > > after the savepoints, the history of the two jobs will always be > different > > after that, right? > > > > Thanks, > > > > Jiangjie (Becket) Qin > > > > On Mon, Jul 8, 2019 at 9:53 PM Kostas Kloudas <kklou...@gmail.com> > wrote: > > > > > Hi Devs, > > > > > > Currently there is a number of efforts around checkpoints/savepoints, > as > > > reflected by the number of FLIPs. From a quick look FLIP-34, FLIP-41, > > > FLIP-43, and FLIP-45 are all directly related to these topics. This > > > reflects the importance of these two notions/features to the users of > the > > > framework. > > > > > > Although many efforts are centred around these notions, their semantics > > and > > > the interplay between them is not always clearly defined. This makes > them > > > difficult to explain them to the users (all the different combinations > of > > > state-backends, formats and tradeoffs) and in some cases it may have > > > negative effects to the users (e.g. the already-fixed-some-time-ago > issue > > > of savepoints not being considered for recovery although they committed > > > side-effects). > > > > > > FLIP-47 [1] and the related Document [2] is aiming at starting a > > discussion > > > around the semantics of savepoints/checkpoints and their interplay, and > > to > > > some extent help us fix the future steps concerning these notions. As > an > > > example, should we work towards bringing them closer, or moving them > > > further apart. > > > > > > This is not a complete proposal (by no means), as many of the practical > > > implications can only be fleshed out after we agree on the basic > > semantics > > > and the general frame around these notions. To that end, there are no > > > concrete implementation steps and the FLIP is going to be updated as > the > > > discussion continues. > > > > > > I am really looking forward to your opinions on the topic. > > > > > > Cheers, > > > Kostas > > > > > > [1] > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-47%3A+Checkpoints+vs.+Savepoints > > > [2] > > > > > > > > > https://docs.google.com/document/d/1_1FF8D3u0tT_zHWtB-hUKCP_arVsxlmjwmJ-TvZd4fs/edit?usp=sharing > > > > > >