Re: Redeployements and state

Niels Basjes Fri, 22 Jan 2016 06:07:45 -0800

Hi,

@Max: Thanks for the new URL. I noticed that a lot (in fact almost all) of
links in the new manuals lead to 404 errors. Maybe you should run an
automated test to find them all.

I did a bit of reading about the savepoints and that in fact they are
written as "Allow to trigger checkpoints manually".

Let me sketch what I think I need:
1) I need recovery of the topology in case of partial failure (i.e. a
single node dies).
2) I need recovery of the topology in case of full topology failure (i.e.
Hadoop security tokens cause the entire thing to die, or I need to deploy a
fixed version of my software).

Now what I understand is that the checkpoints are managed by Flink and as
such allow me to run the topology without any manual actions. These are
cleaned automatically when no longer needed.
These savepoints however appear to need external 'intervention'; they are
intended as 'manual'. So in addition to my topology I need something extra
that periodically (i.e. every minute) fires a command to persist a
checkpoint into a savepoint and to cleanup the 'old' ones.

What I want is something that works roughly as follows:
1) I configure everything (i.e. assign Ids configure the checkpoint
directory, etc.)
2) The framework saves and cleans the checkpoints automatically when the
topology is running.
3) I simply start the topology without any special options.

My idea is essentially that at the startup of a topology the system looks
at the configured checkpoint persistance and recovers the most recent one.

Apparently there is a mismatch between what I think is useful and what has
been implemented so far.
Am I missing something or should I submit this as a Jira ticket for a later
version?

Niels Basjes

On Mon, Jan 18, 2016 at 12:13 PM, Maximilian Michels <m...@apache.org> wrote:

> The documentation layout changed in the master. Then new URL:
>
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/savepoints.html
>
> On Thu, Jan 14, 2016 at 2:21 PM, Niels Basjes <ni...@basjes.nl> wrote:
> > Yes, that is exactly the type of solution I was looking for.
> >
> > I'll dive into this.
> > Thanks guys!
> >
> > Niels
> >
> > On Thu, Jan 14, 2016 at 11:55 AM, Ufuk Celebi <u...@apache.org> wrote:
> >>
> >> Hey Niels,
> >>
> >> as Gabor wrote, this feature has been merged to the master branch
> >> recently.
> >>
> >> The docs are online here:
> >>
> https://ci.apache.org/projects/flink/flink-docs-master/apis/savepoints.html
> >>
> >> Feel free to report back your experience with it if you give it a try.
> >>
> >> – Ufuk
> >>
> >> > On 14 Jan 2016, at 11:09, Gábor Gévay <gga...@gmail.com> wrote:
> >> >
> >> > Hello,
> >> >
> >> > You are probably looking for this feature:
> >> > https://issues.apache.org/jira/browse/FLINK-2976
> >> >
> >> > Best,
> >> > Gábor
> >> >
> >> >
> >> >
> >> >
> >> > 2016-01-14 11:05 GMT+01:00 Niels Basjes <ni...@basjes.nl>:
> >> >> Hi,
> >> >>
> >> >> I'm working on a streaming application using Flink.
> >> >> Several steps in the processing are state-full (I use custom Windows
> >> >> and
> >> >> state-full operators ).
> >> >>
> >> >> Now if during a normal run an worker fails the checkpointing system
> >> >> will be
> >> >> used to recover.
> >> >>
> >> >> But what if the entire application is stopped (deliberately) or
> >> >> stops/fails
> >> >> because of a problem?
> >> >>
> >> >> At this moment I have three main reasons/causes for doing this:
> >> >> 1) The application just dies because of a bug on my side or a problem
> >> >> like
> >> >> for example this (which I'm actually confronted with):  Failed to
> >> >> Update
> >> >> HDFS Delegation Token for long running application in HA mode
> >> >> https://issues.apache.org/jira/browse/HDFS-9276
> >> >> 2) I need to rebalance my application (i.e. stop, change parallelism,
> >> >> start)
> >> >> 3) I need a new version of my software to be deployed. (i.e. I fixed
> a
> >> >> bug,
> >> >> changed the topology and need to continue)
> >> >>
> >> >> I assume the solution will be in some part be specific for my
> >> >> application.
> >> >> The question is what features exist in Flink to support such a clean
> >> >> "continue where I left of" scenario?
> >> >>
> >> >> --
> >> >> Best regards / Met vriendelijke groeten,
> >> >>
> >> >> Niels Basjes
> >>
> >
> >
> >
> > --
> > Best regards / Met vriendelijke groeten,
> >
> > Niels Basjes
>

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Redeployements and state

Reply via email to