Re: [FLIP-47] Savepoints vs Checkpoints

Yu Li Wed, 10 Jul 2019 02:05:52 -0700

Hi all,

Please allow me to throw some points in combination of FLIP-45 [1] for
discussing, and please don't be confused if some of them are inconsistent
or even opposite to current proposals in FLIP-47 (with me as a co-author),
because as Kostas pointed out, the discussion is still in progress and
hasn't reached to a consensus, but we all agreed to move it forward to
public to collect more feedbacks.


FLIP-45 and FLIP-47 all touches the checkpoint and savepoint concept clean
up but in two different ways, and below are my understanding about their
variance and pros/cons:

* FLIP-45 proposes to map the concepts of Flink checkpoint and savepoint to
database checkpoint and backup, furthermore the periodic system-triggered
checkpoint to flurry [2] checkpoint and the stop-with-checkpoint to sharp
[3] checkpoint. And mentions whether we should introduce a Flink concept
relative to database snapshot, which IMHO we could use FLIP-47 as a good
start for discussion.

   - Pros
      - No change from user perspective, both conceptually and physically,
      thus no additional education cost. (Semantic correction are mainly for
      developer to understand)
      - Concept mapping to a mature system (database) could help to make it
      clear, as well as facilitating implement and explain db-like functions in
      future, such as FLIP-43 [4] and streaming ledger [5]
   - Cons
      - Less beneficial for developers with no database experience (need to
      learn database concepts to understand Flink's)
      - One may argue that Flink is Flink (stream processing engine), not
      database


* FLIP-47 proposes to unify the concepts of Flink checkpoint and savepoint
to snapshot, with a unified command.

   - Pros
      - Pure Flink concepts, no additional cost to learn/compare concepts
      in other systems
      - Unified semantic from developer perspective
   - Cons
      - Detectable change from user perspective, need to re-map the
      existing checkpoint/savepoint use cases to new commands
         - Currently: checkpoint for failover, savepoint for
         
upgrade/state-migration/switch-backend/import-export/blue-red-deployment
         - Future: every use case to newly introduced command, for example
         (the command format is just pseudo):
            - Command format:
            - take snapshot [mode] [format]
             - mode: full(default), incremental
             - format: UNIFIED, DEFAULT(default, backend specified)

            Use case:
            - Resume after stop/cancel: take snapshot incremental DEFAULT
            - Upgrade: take snapshot full DEFAULT
            - State migration: take snapshot full DEFAULT
            - Switch backend: take snapshot full UNIFIED
            - Blue/red deployment: take snapshot incremental DEFAULT


   - No new functionality supplied but requires user action

And please correct me or give supplements if I've stated anything
wrong/missed anything @Kostas @Aljoscha @Konstantin. Thanks!

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-45%3A+Reinforce+Job+Stop+Semantic
[2]
https://dev.mysql.com/doc/refman/8.0/en/glossary.html#glos_fuzzy_checkpointing
[3]
https://dev.mysql.com/doc/refman/8.0/en/glossary.html#glos_sharp_checkpoint
[4]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-43%3A+State+Processor+API
[5] https://github.com/dataArtisans/da-streamingledger

Best Regards,
Yu


On Wed, 10 Jul 2019 at 15:02, Congxian Qiu <qcx978132...@gmail.com> wrote:

> Hi Kostas
>
> Thanks for bringing this up. Currently, there are indeed some overlaps
> between checkpoint and savepoint that will make user confused. I think the
> FLIP's proposal can give users a clearer description.
>
> About the FLIP, I have a question about  “Deleting or moving a snapshot
> must be done by Flink", seems like we will support MOVE/DELETE the stopped
> job's snapshot.   What should the user do when he/she wants to DELETE/MOVE
> a stopped job's snapshot
>
> Best,
> Congxian
>
>
> Becket Qin <becket....@gmail.com> 于2019年7月10日周三 上午9:33写道：
>
> > Hi Kostas,
> >
> > It makes a lot of sense to just have one underlying mechanism (snapshot)
> to
> > save the state of a Flink job. And we can use that mechanism in different
> > scenarios, including checkpoint and user-triggered savepoint.
> >
> > To facilitate the discussion, maybe it is useful to clarify a few design
> > goals, for example:
> >
> > 1. one unified snapshot format that supports
> >      - both incremental and global state saving
> >      - rescaling on recovery
> >      - compatibility check / migration across different Flink versions?
> > 2. The snapshot can easily be managed by users.
> >
> >
> > And I have two questions regarding the FLIP.
> >
> > 1. What are the side-effects when taking a snapshot? Do you mean taking
> > snapshot may triggers some action other than saving the state of the Job.
> > Technically speaking, taking snapshot should be a "read-only" action to
> the
> > Flink jobs. So I assume by side-effects, you meant it's no-longer
> > read-only. If so, can you be more specific on what are the side-effects
> you
> > are referring to?
> >
> > 2. In the rejected alternative, you mentioned a scenario of AB testing.
> It
> > seems that if execution A and execution B runs different configurations
> > after the savepoints, the history of the two jobs will always be
> different
> > after that, right?
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Mon, Jul 8, 2019 at 9:53 PM Kostas Kloudas <kklou...@gmail.com>
> wrote:
> >
> > > Hi Devs,
> > >
> > > Currently there is a number of efforts around checkpoints/savepoints,
> as
> > > reflected by the number of FLIPs. From a quick look FLIP-34, FLIP-41,
> > > FLIP-43, and FLIP-45 are all directly related to these topics. This
> > > reflects the importance of these two notions/features to the users of
> the
> > > framework.
> > >
> > > Although many efforts are centred around these notions, their semantics
> > and
> > > the interplay between them is not always clearly defined. This makes
> them
> > > difficult to explain them to the users (all the different combinations
> of
> > > state-backends, formats and tradeoffs) and in some cases it may have
> > > negative effects to the users (e.g. the already-fixed-some-time-ago
> issue
> > > of savepoints not being considered for recovery although they committed
> > > side-effects).
> > >
> > > FLIP-47 [1] and the related Document [2] is aiming at starting a
> > discussion
> > > around the semantics of savepoints/checkpoints and their interplay, and
> > to
> > > some extent help us fix the future steps concerning these notions. As
> an
> > > example, should we work towards bringing them closer, or moving them
> > > further apart.
> > >
> > > This is not a complete proposal (by no means), as many of the practical
> > > implications can only be fleshed out after we agree on the basic
> > semantics
> > > and the general frame around these notions. To that end, there are no
> > > concrete implementation steps and the FLIP is going to be updated as
> the
> > > discussion continues.
> > >
> > > I am really looking forward to your opinions on the topic.
> > >
> > > Cheers,
> > > Kostas
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-47%3A+Checkpoints+vs.+Savepoints
> > > [2]
> > >
> > >
> >
> https://docs.google.com/document/d/1_1FF8D3u0tT_zHWtB-hUKCP_arVsxlmjwmJ-TvZd4fs/edit?usp=sharing
> > >
> >
>

Re: [FLIP-47] Savepoints vs Checkpoints

Reply via email to