Re: Redeployements and state

Ufuk Celebi Tue, 26 Jan 2016 02:07:17 -0800

Hey Niels!

Stephan gave a very good summary of the current state of things. What do you 
think of the outlined stop with savepoint method?


Regarding the broken links: I’ve fixed various broken links in the master docs 
yesterday. If you encounter something again, feel free to post it to the ML or 
open a JIRA for it.

– Ufuk

> On 25 Jan 2016, at 16:21, Stephan Ewen <se...@apache.org> wrote:
> 
> Hi Niels!
> 
> There is a slight mismatch between your thoughts and the current design, but 
> not much.
> 
> What you describe (at the start of the job, the latest checkpoint is 
> automatically loaded) is basically what the high-availability setup does if 
> the master dies. The new master loads all jobs and continues them from the 
> latest checkpoint.
> If you run an HA setup, and you stop/restart your jobs not by stopping the 
> jobs, but by killing the cluster, you should get that behavior.
> 
> Once a job is properly stopped, and you start a new job, there is no way for 
> Flink to tell that this is in fact the same job and it should resume from 
> where the recently stopped. Also, "same" should be a fuzzy "same", to allow 
> for slight changes in the job (bug fixes). Safepoints let you put the 
> persistent part of the job somewhere, to tell a new job where to pick up from.
>   - Makes it work in non-HA setups
>   - Allows you to keep multiple savepoint (like "versions", say one per day 
> or so) to roll back to
>   - Can have multiple versions of the same jobs resuming from one savepoint 
> (what-if or A/B tests, or seamless version upgrades)
> 
> 
> There is something on the roadmap that would make your use case very easy: 
> "StopWithSavepoint"
> 
> There is an open pull request to cleanly stop() a streaming program. The next 
> enhancement is to stop it and let it draw a savepoint as part of that. Then 
> you can simply script a stop/start like that:
> 
> # stop with savepoint
> bin/flink stop -s <random-directory> jobid
> 
> # resume
> bin/flink run -s <random-directory> job
> 
> 
> Hope that helps,
> Stephan
> 
> 
> On Fri, Jan 22, 2016 at 3:06 PM, Niels Basjes <ni...@basjes.nl> wrote:
> Hi,
> 
> @Max: Thanks for the new URL. I noticed that a lot (in fact almost all) of 
> links in the new manuals lead to 404 errors. Maybe you should run an 
> automated test to find them all.
> 
> I did a bit of reading about the savepoints and that in fact they are written 
> as "Allow to trigger checkpoints manually".
> 
> Let me sketch what I think I need:
> 1) I need recovery of the topology in case of partial failure (i.e. a single 
> node dies).
> 2) I need recovery of the topology in case of full topology failure (i.e. 
> Hadoop security tokens cause the entire thing to die, or I need to deploy a 
> fixed version of my software).
> 
> Now what I understand is that the checkpoints are managed by Flink and as 
> such allow me to run the topology without any manual actions. These are 
> cleaned automatically when no longer needed.
> These savepoints however appear to need external 'intervention'; they are 
> intended as 'manual'. So in addition to my topology I need something extra 
> that periodically (i.e. every minute) fires a command to persist a checkpoint 
> into a savepoint and to cleanup the 'old' ones.
> 
> What I want is something that works roughly as follows:
> 1) I configure everything (i.e. assign Ids configure the checkpoint 
> directory, etc.)
> 2) The framework saves and cleans the checkpoints automatically when the 
> topology is running.
> 3) I simply start the topology without any special options.
> 
> My idea is essentially that at the startup of a topology the system looks at 
> the configured checkpoint persistance and recovers the most recent one.
> 
> Apparently there is a mismatch between what I think is useful and what has 
> been implemented so far. 
> Am I missing something or should I submit this as a Jira ticket for a later 
> version?
> 
> Niels Basjes
> 
> 
> 
> 
> 
> 
> On Mon, Jan 18, 2016 at 12:13 PM, Maximilian Michels <m...@apache.org> wrote:
> The documentation layout changed in the master. Then new URL:
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/savepoints.html
> 
> On Thu, Jan 14, 2016 at 2:21 PM, Niels Basjes <ni...@basjes.nl> wrote:
> > Yes, that is exactly the type of solution I was looking for.
> >
> > I'll dive into this.
> > Thanks guys!
> >
> > Niels
> >
> > On Thu, Jan 14, 2016 at 11:55 AM, Ufuk Celebi <u...@apache.org> wrote:
> >>
> >> Hey Niels,
> >>
> >> as Gabor wrote, this feature has been merged to the master branch
> >> recently.
> >>
> >> The docs are online here:
> >> https://ci.apache.org/projects/flink/flink-docs-master/apis/savepoints.html
> >>
> >> Feel free to report back your experience with it if you give it a try.
> >>
> >> – Ufuk
> >>
> >> > On 14 Jan 2016, at 11:09, Gábor Gévay <gga...@gmail.com> wrote:
> >> >
> >> > Hello,
> >> >
> >> > You are probably looking for this feature:
> >> > https://issues.apache.org/jira/browse/FLINK-2976
> >> >
> >> > Best,
> >> > Gábor
> >> >
> >> >
> >> >
> >> >
> >> > 2016-01-14 11:05 GMT+01:00 Niels Basjes <ni...@basjes.nl>:
> >> >> Hi,
> >> >>
> >> >> I'm working on a streaming application using Flink.
> >> >> Several steps in the processing are state-full (I use custom Windows
> >> >> and
> >> >> state-full operators ).
> >> >>
> >> >> Now if during a normal run an worker fails the checkpointing system
> >> >> will be
> >> >> used to recover.
> >> >>
> >> >> But what if the entire application is stopped (deliberately) or
> >> >> stops/fails
> >> >> because of a problem?
> >> >>
> >> >> At this moment I have three main reasons/causes for doing this:
> >> >> 1) The application just dies because of a bug on my side or a problem
> >> >> like
> >> >> for example this (which I'm actually confronted with):  Failed to
> >> >> Update
> >> >> HDFS Delegation Token for long running application in HA mode
> >> >> https://issues.apache.org/jira/browse/HDFS-9276
> >> >> 2) I need to rebalance my application (i.e. stop, change parallelism,
> >> >> start)
> >> >> 3) I need a new version of my software to be deployed. (i.e. I fixed a
> >> >> bug,
> >> >> changed the topology and need to continue)
> >> >>
> >> >> I assume the solution will be in some part be specific for my
> >> >> application.
> >> >> The question is what features exist in Flink to support such a clean
> >> >> "continue where I left of" scenario?
> >> >>
> >> >> --
> >> >> Best regards / Met vriendelijke groeten,
> >> >>
> >> >> Niels Basjes
> >>
> >
> >
> >
> > --
> > Best regards / Met vriendelijke groeten,
> >
> > Niels Basjes
> 
> 
> 
> -- 
> Best regards / Met vriendelijke groeten,
> 
> Niels Basjes
>

Re: Redeployements and state

Reply via email to