Hey Niels! Stephan gave a very good summary of the current state of things. What do you think of the outlined stop with savepoint method?
Regarding the broken links: I’ve fixed various broken links in the master docs yesterday. If you encounter something again, feel free to post it to the ML or open a JIRA for it. – Ufuk > On 25 Jan 2016, at 16:21, Stephan Ewen <se...@apache.org> wrote: > > Hi Niels! > > There is a slight mismatch between your thoughts and the current design, but > not much. > > What you describe (at the start of the job, the latest checkpoint is > automatically loaded) is basically what the high-availability setup does if > the master dies. The new master loads all jobs and continues them from the > latest checkpoint. > If you run an HA setup, and you stop/restart your jobs not by stopping the > jobs, but by killing the cluster, you should get that behavior. > > Once a job is properly stopped, and you start a new job, there is no way for > Flink to tell that this is in fact the same job and it should resume from > where the recently stopped. Also, "same" should be a fuzzy "same", to allow > for slight changes in the job (bug fixes). Safepoints let you put the > persistent part of the job somewhere, to tell a new job where to pick up from. > - Makes it work in non-HA setups > - Allows you to keep multiple savepoint (like "versions", say one per day > or so) to roll back to > - Can have multiple versions of the same jobs resuming from one savepoint > (what-if or A/B tests, or seamless version upgrades) > > > There is something on the roadmap that would make your use case very easy: > "StopWithSavepoint" > > There is an open pull request to cleanly stop() a streaming program. The next > enhancement is to stop it and let it draw a savepoint as part of that. Then > you can simply script a stop/start like that: > > # stop with savepoint > bin/flink stop -s <random-directory> jobid > > # resume > bin/flink run -s <random-directory> job > > > Hope that helps, > Stephan > > > On Fri, Jan 22, 2016 at 3:06 PM, Niels Basjes <ni...@basjes.nl> wrote: > Hi, > > @Max: Thanks for the new URL. I noticed that a lot (in fact almost all) of > links in the new manuals lead to 404 errors. Maybe you should run an > automated test to find them all. > > I did a bit of reading about the savepoints and that in fact they are written > as "Allow to trigger checkpoints manually". > > Let me sketch what I think I need: > 1) I need recovery of the topology in case of partial failure (i.e. a single > node dies). > 2) I need recovery of the topology in case of full topology failure (i.e. > Hadoop security tokens cause the entire thing to die, or I need to deploy a > fixed version of my software). > > Now what I understand is that the checkpoints are managed by Flink and as > such allow me to run the topology without any manual actions. These are > cleaned automatically when no longer needed. > These savepoints however appear to need external 'intervention'; they are > intended as 'manual'. So in addition to my topology I need something extra > that periodically (i.e. every minute) fires a command to persist a checkpoint > into a savepoint and to cleanup the 'old' ones. > > What I want is something that works roughly as follows: > 1) I configure everything (i.e. assign Ids configure the checkpoint > directory, etc.) > 2) The framework saves and cleans the checkpoints automatically when the > topology is running. > 3) I simply start the topology without any special options. > > My idea is essentially that at the startup of a topology the system looks at > the configured checkpoint persistance and recovers the most recent one. > > Apparently there is a mismatch between what I think is useful and what has > been implemented so far. > Am I missing something or should I submit this as a Jira ticket for a later > version? > > Niels Basjes > > > > > > > On Mon, Jan 18, 2016 at 12:13 PM, Maximilian Michels <m...@apache.org> wrote: > The documentation layout changed in the master. Then new URL: > https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/savepoints.html > > On Thu, Jan 14, 2016 at 2:21 PM, Niels Basjes <ni...@basjes.nl> wrote: > > Yes, that is exactly the type of solution I was looking for. > > > > I'll dive into this. > > Thanks guys! > > > > Niels > > > > On Thu, Jan 14, 2016 at 11:55 AM, Ufuk Celebi <u...@apache.org> wrote: > >> > >> Hey Niels, > >> > >> as Gabor wrote, this feature has been merged to the master branch > >> recently. > >> > >> The docs are online here: > >> https://ci.apache.org/projects/flink/flink-docs-master/apis/savepoints.html > >> > >> Feel free to report back your experience with it if you give it a try. > >> > >> – Ufuk > >> > >> > On 14 Jan 2016, at 11:09, Gábor Gévay <gga...@gmail.com> wrote: > >> > > >> > Hello, > >> > > >> > You are probably looking for this feature: > >> > https://issues.apache.org/jira/browse/FLINK-2976 > >> > > >> > Best, > >> > Gábor > >> > > >> > > >> > > >> > > >> > 2016-01-14 11:05 GMT+01:00 Niels Basjes <ni...@basjes.nl>: > >> >> Hi, > >> >> > >> >> I'm working on a streaming application using Flink. > >> >> Several steps in the processing are state-full (I use custom Windows > >> >> and > >> >> state-full operators ). > >> >> > >> >> Now if during a normal run an worker fails the checkpointing system > >> >> will be > >> >> used to recover. > >> >> > >> >> But what if the entire application is stopped (deliberately) or > >> >> stops/fails > >> >> because of a problem? > >> >> > >> >> At this moment I have three main reasons/causes for doing this: > >> >> 1) The application just dies because of a bug on my side or a problem > >> >> like > >> >> for example this (which I'm actually confronted with): Failed to > >> >> Update > >> >> HDFS Delegation Token for long running application in HA mode > >> >> https://issues.apache.org/jira/browse/HDFS-9276 > >> >> 2) I need to rebalance my application (i.e. stop, change parallelism, > >> >> start) > >> >> 3) I need a new version of my software to be deployed. (i.e. I fixed a > >> >> bug, > >> >> changed the topology and need to continue) > >> >> > >> >> I assume the solution will be in some part be specific for my > >> >> application. > >> >> The question is what features exist in Flink to support such a clean > >> >> "continue where I left of" scenario? > >> >> > >> >> -- > >> >> Best regards / Met vriendelijke groeten, > >> >> > >> >> Niels Basjes > >> > > > > > > > > -- > > Best regards / Met vriendelijke groeten, > > > > Niels Basjes > > > > -- > Best regards / Met vriendelijke groeten, > > Niels Basjes >