Re: Force enabling checkpoints for iterative streaming jobs

Gyula Fóra Wed, 10 Jun 2015 05:31:18 -0700

Max suggested that I add this feature slightly hidden to the execution
config instance.


The problem then is that I either make a public field in the config or once
again add a method.

Any ideas?

Aljoscha Krettek <aljos...@apache.org> ezt írta (időpont: 2015. jún. 10.,
Sze, 14:07):

> Thanks :D, now I see. It makes sense because we don't have another way
> of keeping the cluster state synced/distributed across parallel
> instances of the operators.
>
> On Wed, Jun 10, 2015 at 12:52 PM, Gyula Fóra <gyula.f...@gmail.com> wrote:
> > Here is an example for you:
> >
> > Parallel streaming kmeans, the state we keep is the current cluster
> > centers, and we use iterations to sync the centers across parallel
> > instances.
> > We can afford lost model updated in the loop but we need the checkpoint
> the
> > models.
> >
> >
> https://github.com/gyfora/stream-clustering/blob/master/src/main/scala/stream/clustering/StreamClustering.scala
> >
> > (checkpointing is not turned on but you will get the point)
> >
> >
> >
> > Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2015. jún. 10.,
> Sze,
> > 12:47):
> >
> >> You are right, to have consistent results we would need to persist the
> >> records.
> >>
> >> But since we cannot do that right now, we can still checkpoint all
> >> operator states and understand that inflight records in the loop are
> lost
> >> on failure.
> >>
> >> This is acceptable for most the use-cases that we have developed so far
> >> for iterations (machine learning, graph updates, etc.) What is not
> >> acceptable is to not have checkpointing at all.
> >>
> >> Aljoscha Krettek <aljos...@apache.org> ezt írta (időpont: 2015. jún.
> 10.,
> >> Sze, 12:43):
> >>
> >>> The elements that are in-flight in an iteration are also state of the
> >>> job. I'm wondering whether the state inside iterations still makes
> >>> sense without these in-flight elements. But I also don't know the King
> >>> use-case, that's why I though an example could be helpful.
> >>>
> >>> On Wed, Jun 10, 2015 at 12:37 PM, Gyula Fóra <gyula.f...@gmail.com>
> >>> wrote:
> >>> > I don't understand the question, I vote for checkpointing all state
> in
> >>> the
> >>> > job, even inside iterations (its more of a loop).
> >>> >
> >>> > Aljoscha Krettek <aljos...@apache.org> ezt írta (időpont: 2015. jún.
> >>> 10.,
> >>> > Sze, 12:34):
> >>> >
> >>> >> I don't understand why having the state inside an iteration but not
> >>> >> the elements that correspond to this state or created this state is
> >>> >> desirable. Maybe an example could help understand this better?
> >>> >>
> >>> >> On Wed, Jun 10, 2015 at 11:27 AM, Gyula Fóra <gyula.f...@gmail.com>
> >>> wrote:
> >>> >> > The other tests verify that the checkpointing algorithm runs
> >>> properly.
> >>> >> That
> >>> >> > also ensures that it runs for iterations because a loop is just an
> >>> extra
> >>> >> > source and sink in the jobgraph (so it is the same for the
> >>> algorithm).
> >>> >> >
> >>> >> > Fabian Hueske <fhue...@gmail.com> ezt írta (időpont: 2015. jún.
> 10.,
> >>> >> Sze,
> >>> >> > 11:19):
> >>> >> >
> >>> >> >> Without going into the details, how well tested is this feature?
> >>> The PR
> >>> >> >> only extends one test by a few lines.
> >>> >> >>
> >>> >> >> Is that really enough to ensure that
> >>> >> >> 1) the change does not cause trouble
> >>> >> >> 2) is working as expected
> >>> >> >>
> >>> >> >> If this feature should go into the release, it must be thoroughly
> >>> >> checked
> >>> >> >> and we must take the time for that.
> >>> >> >> Including code and hoping for the best because time is scarce is
> >>> not an
> >>> >> >> option IMO.
> >>> >> >>
> >>> >> >> Fabian
> >>> >> >>
> >>> >> >>
> >>> >> >> 2015-06-10 11:05 GMT+02:00 Gyula Fóra <gyula.f...@gmail.com>:
> >>> >> >>
> >>> >> >> > And also I would like to remind everyone that any fault
> tolerance
> >>> we
> >>> >> >> > provide is only as good as the fault tolerance of the master
> node.
> >>> >> Which
> >>> >> >> is
> >>> >> >> > non existent at the moment.
> >>> >> >> >
> >>> >> >> > So I don't see a reason why a user should not be able to choose
> >>> >> whether
> >>> >> >> he
> >>> >> >> > wants state checkpoints for iterations as well.
> >>> >> >> >
> >>> >> >> > In any case this will be used by King for instance, so making
> it
> >>> part
> >>> >> of
> >>> >> >> > the release would save a lot of work for everyone.
> >>> >> >> >
> >>> >> >> > Paris Carbone <par...@kth.se> ezt írta (időpont: 2015. jún.
> 10.,
> >>> Sze,
> >>> >> >> > 10:29):
> >>> >> >> >
> >>> >> >> > >
> >>> >> >> > > To continue Gyula's point, for consistent snapshots we need
> to
> >>> >> persist
> >>> >> >> > the
> >>> >> >> > > records in transit within the loop  and also slightly change
> the
> >>> >> >> current
> >>> >> >> > > protocol since it works only for DAGs. Before going into that
> >>> >> direction
> >>> >> >> > > though I would propose we first see whether there is a nice
> way
> >>> to
> >>> >> make
> >>> >> >> > > iterations more structured.
> >>> >> >> > >
> >>> >> >> > > Paris
> >>> >> >> > > ________________________________________
> >>> >> >> > > From: Gyula Fóra <gyula.f...@gmail.com>
> >>> >> >> > > Sent: Wednesday, June 10, 2015 10:19 AM
> >>> >> >> > > To: dev@flink.apache.org
> >>> >> >> > > Subject: Re: Force enabling checkpoints for iterative
> streaming
> >>> jobs
> >>> >> >> > >
> >>> >> >> > > I disagree. Not having checkpointed operators inside the
> >>> iteration
> >>> >> >> still
> >>> >> >> > > breaks the guarantees.
> >>> >> >> > >
> >>> >> >> > > It is not about the states it is about the loop itself.
> >>> >> >> > > On Wed, Jun 10, 2015 at 10:12 AM Aljoscha Krettek <
> >>> >> aljos...@apache.org
> >>> >> >> >
> >>> >> >> > > wrote:
> >>> >> >> > >
> >>> >> >> > > > This is the answer I gave on the PR (we should have one
> place
> >>> for
> >>> >> >> > > > discussing this, though):
> >>> >> >> > > >
> >>> >> >> > > > I would be against merging this in the current form. What I
> >>> >> propose
> >>> >> >> is
> >>> >> >> > > > to analyse the topology to verify that there are no
> >>> checkpointed
> >>> >> >> > > > operators inside iterations. Operators before and after
> >>> iterations
> >>> >> >> can
> >>> >> >> > > > be checkpointed and we can safely allow the user to enable
> >>> >> >> > > > checkpointing.
> >>> >> >> > > >
> >>> >> >> > > > If we have the code to analyse which operators are inside
> >>> >> iterations
> >>> >> >> > > > we could also disallow windows inside iterations. I think
> >>> windows
> >>> >> >> > > > inside iterations don't make sense since elements in
> different
> >>> >> >> > > > "iterations" would end up in the same window. Maybe I'm
> wrong
> >>> here
> >>> >> >> > > > though, then please correct me.
> >>> >> >> > > >
> >>> >> >> > > > On Wed, Jun 10, 2015 at 10:08 AM, Márton Balassi
> >>> >> >> > > > <balassi.mar...@gmail.com> wrote:
> >>> >> >> > > > > I agree that for the sake of the above mentioned use
> cases
> >>> it is
> >>> >> >> > > > reasonable
> >>> >> >> > > > > to add this to the release with the right documentation,
> for
> >>> >> >> machine
> >>> >> >> > > > > learning potentially loosing one round of feedback data
> >>> should
> >>> >> not
> >>> >> >> > > > matter.
> >>> >> >> > > > >
> >>> >> >> > > > > Let us not block prominent users until the next release
> on
> >>> this.
> >>> >> >> > > > >
> >>> >> >> > > > > On Wed, Jun 10, 2015 at 8:09 AM, Gyula Fóra <
> >>> >> gyula.f...@gmail.com>
> >>> >> >> > > > wrote:
> >>> >> >> > > > >
> >>> >> >> > > > >> As for people currently suffering from it:
> >>> >> >> > > > >>
> >>> >> >> > > > >> An application King is developing requires iterations,
> and
> >>> they
> >>> >> >> need
> >>> >> >> > > > >> checkpoints. Practically all SAMOA programs would need
> >>> this.
> >>> >> >> > > > >>
> >>> >> >> > > > >> It is very likely that the state interfaces will be
> changed
> >>> >> after
> >>> >> >> > the
> >>> >> >> > > > >> release, so this is not something that we can just add
> >>> later. I
> >>> >> >> > don't
> >>> >> >> > > > see a
> >>> >> >> > > > >> reason why we should not add it, as it is clearly
> >>> documented.
> >>> >> In
> >>> >> >> > this
> >>> >> >> > > > >> actual case not having guarantees at all means people
> will
> >>> >> never
> >>> >> >> use
> >>> >> >> > > it
> >>> >> >> > > > in
> >>> >> >> > > > >> any production system. Having limited guarantees means
> >>> that it
> >>> >> >> will
> >>> >> >> > > > depend
> >>> >> >> > > > >> on the application.
> >>> >> >> > > > >>
> >>> >> >> > > > >> On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi <
> >>> u...@apache.org>
> >>> >> >> > wrote:
> >>> >> >> > > > >>
> >>> >> >> > > > >> > Hey Gyula,
> >>> >> >> > > > >> >
> >>> >> >> > > > >> > I understand your reasoning, but I don't think its
> worth
> >>> to
> >>> >> rush
> >>> >> >> > > this
> >>> >> >> > > > >> into
> >>> >> >> > > > >> > the release.
> >>> >> >> > > > >> >
> >>> >> >> > > > >> > As you've said, we cannot give precise guarantees. But
> >>> this
> >>> >> is
> >>> >> >> > > > arguably
> >>> >> >> > > > >> > one of the key requirements for any fault tolerance
> >>> >> mechanism.
> >>> >> >> > > > Therefore
> >>> >> >> > > > >> I
> >>> >> >> > > > >> > disagree that this is better than not having anything
> at
> >>> >> all. I
> >>> >> >> > > think
> >>> >> >> > > > it
> >>> >> >> > > > >> > will already go a long way to have the non-iterative
> case
> >>> >> >> working
> >>> >> >> > > > >> reliably.
> >>> >> >> > > > >> >
> >>> >> >> > > > >> > And as far as I know there are no users really
> suffering
> >>> from
> >>> >> >> this
> >>> >> >> > > at
> >>> >> >> > > > the
> >>> >> >> > > > >> > moment (in the sense that someone has complained on
> the
> >>> >> mailing
> >>> >> >> > > list).
> >>> >> >> > > > >> >
> >>> >> >> > > > >> > Hence, I vote to postpone this.
> >>> >> >> > > > >> >
> >>> >> >> > > > >> > – Ufuk
> >>> >> >> > > > >> >
> >>> >> >> > > > >> > On 10 Jun 2015, at 00:19, Gyula Fóra <
> gyf...@apache.org>
> >>> >> wrote:
> >>> >> >> > > > >> >
> >>> >> >> > > > >> > > Hey all,
> >>> >> >> > > > >> > >
> >>> >> >> > > > >> > > It is currently impossible to enable state
> >>> checkpointing
> >>> >> for
> >>> >> >> > > > iterative
> >>> >> >> > > > >> > > jobs, because en exception is thrown when creating
> the
> >>> >> >> jobgraph.
> >>> >> >> > > > This
> >>> >> >> > > > >> > > behaviour is motivated by the lack of precise
> >>> guarantees
> >>> >> that
> >>> >> >> we
> >>> >> >> > > can
> >>> >> >> > > > >> give
> >>> >> >> > > > >> > > with the current fault-tolerance implementations for
> >>> cyclic
> >>> >> >> > > graphs.
> >>> >> >> > > > >> > >
> >>> >> >> > > > >> > > This PR <https://github.com/apache/flink/pull/812>
> >>> adds an
> >>> >> >> > > optional
> >>> >> >> > > > >> > flag to
> >>> >> >> > > > >> > > force checkpoints even in case of iterations. The
> >>> algorithm
> >>> >> >> will
> >>> >> >> > > > take
> >>> >> >> > > > >> > > checkpoints periodically as before, but records in
> >>> transit
> >>> >> >> > inside
> >>> >> >> > > > the
> >>> >> >> > > > >> > loop
> >>> >> >> > > > >> > > will be lost.
> >>> >> >> > > > >> > >
> >>> >> >> > > > >> > > However even this guarantee is enough for most
> >>> applications
> >>> >> >> > > (Machine
> >>> >> >> > > > >> > > Learning for instance) and certainly much better
> than
> >>> not
> >>> >> >> having
> >>> >> >> > > > >> anything
> >>> >> >> > > > >> > > at all.
> >>> >> >> > > > >> > >
> >>> >> >> > > > >> > >
> >>> >> >> > > > >> > > I suggest we add this to the 0.9 release as
> currently
> >>> many
> >>> >> >> > > > applications
> >>> >> >> > > > >> > > suffer from this limitation (SAMOA, ML pipelines,
> graph
> >>> >> >> > streaming
> >>> >> >> > > > etc.)
> >>> >> >> > > > >> > >
> >>> >> >> > > > >> > >
> >>> >> >> > > > >> > > Cheers,
> >>> >> >> > > > >> > >
> >>> >> >> > > > >> > > Gyula
> >>> >> >> > > > >> >
> >>> >> >> > > > >> >
> >>> >> >> > > > >>
> >>> >> >> > > >
> >>> >> >> > >
> >>> >> >> >
> >>> >> >>
> >>> >>
> >>>
> >>
>

Re: Force enabling checkpoints for iterative streaming jobs

Reply via email to