Re: Force enabling checkpoints for iterative streaming jobs

Aljoscha Krettek Wed, 10 Jun 2015 05:08:06 -0700

Thanks :D, now I see. It makes sense because we don't have another way
of keeping the cluster state synced/distributed across parallel
instances of the operators.


On Wed, Jun 10, 2015 at 12:52 PM, Gyula Fóra <[email protected]> wrote:
> Here is an example for you:
>
> Parallel streaming kmeans, the state we keep is the current cluster
> centers, and we use iterations to sync the centers across parallel
> instances.
> We can afford lost model updated in the loop but we need the checkpoint the
> models.
>
> https://github.com/gyfora/stream-clustering/blob/master/src/main/scala/stream/clustering/StreamClustering.scala
>
> (checkpointing is not turned on but you will get the point)
>
>
>
> Gyula Fóra <[email protected]> ezt írta (időpont: 2015. jún. 10., Sze,
> 12:47):
>
>> You are right, to have consistent results we would need to persist the
>> records.
>>
>> But since we cannot do that right now, we can still checkpoint all
>> operator states and understand that inflight records in the loop are lost
>> on failure.
>>
>> This is acceptable for most the use-cases that we have developed so far
>> for iterations (machine learning, graph updates, etc.) What is not
>> acceptable is to not have checkpointing at all.
>>
>> Aljoscha Krettek <[email protected]> ezt írta (időpont: 2015. jún. 10.,
>> Sze, 12:43):
>>
>>> The elements that are in-flight in an iteration are also state of the
>>> job. I'm wondering whether the state inside iterations still makes
>>> sense without these in-flight elements. But I also don't know the King
>>> use-case, that's why I though an example could be helpful.
>>>
>>> On Wed, Jun 10, 2015 at 12:37 PM, Gyula Fóra <[email protected]>
>>> wrote:
>>> > I don't understand the question, I vote for checkpointing all state in
>>> the
>>> > job, even inside iterations (its more of a loop).
>>> >
>>> > Aljoscha Krettek <[email protected]> ezt írta (időpont: 2015. jún.
>>> 10.,
>>> > Sze, 12:34):
>>> >
>>> >> I don't understand why having the state inside an iteration but not
>>> >> the elements that correspond to this state or created this state is
>>> >> desirable. Maybe an example could help understand this better?
>>> >>
>>> >> On Wed, Jun 10, 2015 at 11:27 AM, Gyula Fóra <[email protected]>
>>> wrote:
>>> >> > The other tests verify that the checkpointing algorithm runs
>>> properly.
>>> >> That
>>> >> > also ensures that it runs for iterations because a loop is just an
>>> extra
>>> >> > source and sink in the jobgraph (so it is the same for the
>>> algorithm).
>>> >> >
>>> >> > Fabian Hueske <[email protected]> ezt írta (időpont: 2015. jún. 10.,
>>> >> Sze,
>>> >> > 11:19):
>>> >> >
>>> >> >> Without going into the details, how well tested is this feature?
>>> The PR
>>> >> >> only extends one test by a few lines.
>>> >> >>
>>> >> >> Is that really enough to ensure that
>>> >> >> 1) the change does not cause trouble
>>> >> >> 2) is working as expected
>>> >> >>
>>> >> >> If this feature should go into the release, it must be thoroughly
>>> >> checked
>>> >> >> and we must take the time for that.
>>> >> >> Including code and hoping for the best because time is scarce is
>>> not an
>>> >> >> option IMO.
>>> >> >>
>>> >> >> Fabian
>>> >> >>
>>> >> >>
>>> >> >> 2015-06-10 11:05 GMT+02:00 Gyula Fóra <[email protected]>:
>>> >> >>
>>> >> >> > And also I would like to remind everyone that any fault tolerance
>>> we
>>> >> >> > provide is only as good as the fault tolerance of the master node.
>>> >> Which
>>> >> >> is
>>> >> >> > non existent at the moment.
>>> >> >> >
>>> >> >> > So I don't see a reason why a user should not be able to choose
>>> >> whether
>>> >> >> he
>>> >> >> > wants state checkpoints for iterations as well.
>>> >> >> >
>>> >> >> > In any case this will be used by King for instance, so making it
>>> part
>>> >> of
>>> >> >> > the release would save a lot of work for everyone.
>>> >> >> >
>>> >> >> > Paris Carbone <[email protected]> ezt írta (időpont: 2015. jún. 10.,
>>> Sze,
>>> >> >> > 10:29):
>>> >> >> >
>>> >> >> > >
>>> >> >> > > To continue Gyula's point, for consistent snapshots we need to
>>> >> persist
>>> >> >> > the
>>> >> >> > > records in transit within the loop  and also slightly change the
>>> >> >> current
>>> >> >> > > protocol since it works only for DAGs. Before going into that
>>> >> direction
>>> >> >> > > though I would propose we first see whether there is a nice way
>>> to
>>> >> make
>>> >> >> > > iterations more structured.
>>> >> >> > >
>>> >> >> > > Paris
>>> >> >> > > ________________________________________
>>> >> >> > > From: Gyula Fóra <[email protected]>
>>> >> >> > > Sent: Wednesday, June 10, 2015 10:19 AM
>>> >> >> > > To: [email protected]
>>> >> >> > > Subject: Re: Force enabling checkpoints for iterative streaming
>>> jobs
>>> >> >> > >
>>> >> >> > > I disagree. Not having checkpointed operators inside the
>>> iteration
>>> >> >> still
>>> >> >> > > breaks the guarantees.
>>> >> >> > >
>>> >> >> > > It is not about the states it is about the loop itself.
>>> >> >> > > On Wed, Jun 10, 2015 at 10:12 AM Aljoscha Krettek <
>>> >> [email protected]
>>> >> >> >
>>> >> >> > > wrote:
>>> >> >> > >
>>> >> >> > > > This is the answer I gave on the PR (we should have one place
>>> for
>>> >> >> > > > discussing this, though):
>>> >> >> > > >
>>> >> >> > > > I would be against merging this in the current form. What I
>>> >> propose
>>> >> >> is
>>> >> >> > > > to analyse the topology to verify that there are no
>>> checkpointed
>>> >> >> > > > operators inside iterations. Operators before and after
>>> iterations
>>> >> >> can
>>> >> >> > > > be checkpointed and we can safely allow the user to enable
>>> >> >> > > > checkpointing.
>>> >> >> > > >
>>> >> >> > > > If we have the code to analyse which operators are inside
>>> >> iterations
>>> >> >> > > > we could also disallow windows inside iterations. I think
>>> windows
>>> >> >> > > > inside iterations don't make sense since elements in different
>>> >> >> > > > "iterations" would end up in the same window. Maybe I'm wrong
>>> here
>>> >> >> > > > though, then please correct me.
>>> >> >> > > >
>>> >> >> > > > On Wed, Jun 10, 2015 at 10:08 AM, Márton Balassi
>>> >> >> > > > <[email protected]> wrote:
>>> >> >> > > > > I agree that for the sake of the above mentioned use cases
>>> it is
>>> >> >> > > > reasonable
>>> >> >> > > > > to add this to the release with the right documentation, for
>>> >> >> machine
>>> >> >> > > > > learning potentially loosing one round of feedback data
>>> should
>>> >> not
>>> >> >> > > > matter.
>>> >> >> > > > >
>>> >> >> > > > > Let us not block prominent users until the next release on
>>> this.
>>> >> >> > > > >
>>> >> >> > > > > On Wed, Jun 10, 2015 at 8:09 AM, Gyula Fóra <
>>> >> [email protected]>
>>> >> >> > > > wrote:
>>> >> >> > > > >
>>> >> >> > > > >> As for people currently suffering from it:
>>> >> >> > > > >>
>>> >> >> > > > >> An application King is developing requires iterations, and
>>> they
>>> >> >> need
>>> >> >> > > > >> checkpoints. Practically all SAMOA programs would need
>>> this.
>>> >> >> > > > >>
>>> >> >> > > > >> It is very likely that the state interfaces will be changed
>>> >> after
>>> >> >> > the
>>> >> >> > > > >> release, so this is not something that we can just add
>>> later. I
>>> >> >> > don't
>>> >> >> > > > see a
>>> >> >> > > > >> reason why we should not add it, as it is clearly
>>> documented.
>>> >> In
>>> >> >> > this
>>> >> >> > > > >> actual case not having guarantees at all means people will
>>> >> never
>>> >> >> use
>>> >> >> > > it
>>> >> >> > > > in
>>> >> >> > > > >> any production system. Having limited guarantees means
>>> that it
>>> >> >> will
>>> >> >> > > > depend
>>> >> >> > > > >> on the application.
>>> >> >> > > > >>
>>> >> >> > > > >> On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi <
>>> [email protected]>
>>> >> >> > wrote:
>>> >> >> > > > >>
>>> >> >> > > > >> > Hey Gyula,
>>> >> >> > > > >> >
>>> >> >> > > > >> > I understand your reasoning, but I don't think its worth
>>> to
>>> >> rush
>>> >> >> > > this
>>> >> >> > > > >> into
>>> >> >> > > > >> > the release.
>>> >> >> > > > >> >
>>> >> >> > > > >> > As you've said, we cannot give precise guarantees. But
>>> this
>>> >> is
>>> >> >> > > > arguably
>>> >> >> > > > >> > one of the key requirements for any fault tolerance
>>> >> mechanism.
>>> >> >> > > > Therefore
>>> >> >> > > > >> I
>>> >> >> > > > >> > disagree that this is better than not having anything at
>>> >> all. I
>>> >> >> > > think
>>> >> >> > > > it
>>> >> >> > > > >> > will already go a long way to have the non-iterative case
>>> >> >> working
>>> >> >> > > > >> reliably.
>>> >> >> > > > >> >
>>> >> >> > > > >> > And as far as I know there are no users really suffering
>>> from
>>> >> >> this
>>> >> >> > > at
>>> >> >> > > > the
>>> >> >> > > > >> > moment (in the sense that someone has complained on the
>>> >> mailing
>>> >> >> > > list).
>>> >> >> > > > >> >
>>> >> >> > > > >> > Hence, I vote to postpone this.
>>> >> >> > > > >> >
>>> >> >> > > > >> > – Ufuk
>>> >> >> > > > >> >
>>> >> >> > > > >> > On 10 Jun 2015, at 00:19, Gyula Fóra <[email protected]>
>>> >> wrote:
>>> >> >> > > > >> >
>>> >> >> > > > >> > > Hey all,
>>> >> >> > > > >> > >
>>> >> >> > > > >> > > It is currently impossible to enable state
>>> checkpointing
>>> >> for
>>> >> >> > > > iterative
>>> >> >> > > > >> > > jobs, because en exception is thrown when creating the
>>> >> >> jobgraph.
>>> >> >> > > > This
>>> >> >> > > > >> > > behaviour is motivated by the lack of precise
>>> guarantees
>>> >> that
>>> >> >> we
>>> >> >> > > can
>>> >> >> > > > >> give
>>> >> >> > > > >> > > with the current fault-tolerance implementations for
>>> cyclic
>>> >> >> > > graphs.
>>> >> >> > > > >> > >
>>> >> >> > > > >> > > This PR <https://github.com/apache/flink/pull/812>
>>> adds an
>>> >> >> > > optional
>>> >> >> > > > >> > flag to
>>> >> >> > > > >> > > force checkpoints even in case of iterations. The
>>> algorithm
>>> >> >> will
>>> >> >> > > > take
>>> >> >> > > > >> > > checkpoints periodically as before, but records in
>>> transit
>>> >> >> > inside
>>> >> >> > > > the
>>> >> >> > > > >> > loop
>>> >> >> > > > >> > > will be lost.
>>> >> >> > > > >> > >
>>> >> >> > > > >> > > However even this guarantee is enough for most
>>> applications
>>> >> >> > > (Machine
>>> >> >> > > > >> > > Learning for instance) and certainly much better than
>>> not
>>> >> >> having
>>> >> >> > > > >> anything
>>> >> >> > > > >> > > at all.
>>> >> >> > > > >> > >
>>> >> >> > > > >> > >
>>> >> >> > > > >> > > I suggest we add this to the 0.9 release as currently
>>> many
>>> >> >> > > > applications
>>> >> >> > > > >> > > suffer from this limitation (SAMOA, ML pipelines, graph
>>> >> >> > streaming
>>> >> >> > > > etc.)
>>> >> >> > > > >> > >
>>> >> >> > > > >> > >
>>> >> >> > > > >> > > Cheers,
>>> >> >> > > > >> > >
>>> >> >> > > > >> > > Gyula
>>> >> >> > > > >> >
>>> >> >> > > > >> >
>>> >> >> > > > >>
>>> >> >> > > >
>>> >> >> > >
>>> >> >> >
>>> >> >>
>>> >>
>>>
>>

Re: Force enabling checkpoints for iterative streaming jobs

Reply via email to