Re: Force enabling checkpoints for iterative streaming jobs

Aljoscha Krettek Wed, 10 Jun 2015 03:35:15 -0700

I don't understand why having the state inside an iteration but not
the elements that correspond to this state or created this state is
desirable. Maybe an example could help understand this better?


On Wed, Jun 10, 2015 at 11:27 AM, Gyula Fóra <[email protected]> wrote:
> The other tests verify that the checkpointing algorithm runs properly. That
> also ensures that it runs for iterations because a loop is just an extra
> source and sink in the jobgraph (so it is the same for the algorithm).
>
> Fabian Hueske <[email protected]> ezt írta (időpont: 2015. jún. 10., Sze,
> 11:19):
>
>> Without going into the details, how well tested is this feature? The PR
>> only extends one test by a few lines.
>>
>> Is that really enough to ensure that
>> 1) the change does not cause trouble
>> 2) is working as expected
>>
>> If this feature should go into the release, it must be thoroughly checked
>> and we must take the time for that.
>> Including code and hoping for the best because time is scarce is not an
>> option IMO.
>>
>> Fabian
>>
>>
>> 2015-06-10 11:05 GMT+02:00 Gyula Fóra <[email protected]>:
>>
>> > And also I would like to remind everyone that any fault tolerance we
>> > provide is only as good as the fault tolerance of the master node. Which
>> is
>> > non existent at the moment.
>> >
>> > So I don't see a reason why a user should not be able to choose whether
>> he
>> > wants state checkpoints for iterations as well.
>> >
>> > In any case this will be used by King for instance, so making it part of
>> > the release would save a lot of work for everyone.
>> >
>> > Paris Carbone <[email protected]> ezt írta (időpont: 2015. jún. 10., Sze,
>> > 10:29):
>> >
>> > >
>> > > To continue Gyula's point, for consistent snapshots we need to persist
>> > the
>> > > records in transit within the loop  and also slightly change the
>> current
>> > > protocol since it works only for DAGs. Before going into that direction
>> > > though I would propose we first see whether there is a nice way to make
>> > > iterations more structured.
>> > >
>> > > Paris
>> > > ________________________________________
>> > > From: Gyula Fóra <[email protected]>
>> > > Sent: Wednesday, June 10, 2015 10:19 AM
>> > > To: [email protected]
>> > > Subject: Re: Force enabling checkpoints for iterative streaming jobs
>> > >
>> > > I disagree. Not having checkpointed operators inside the iteration
>> still
>> > > breaks the guarantees.
>> > >
>> > > It is not about the states it is about the loop itself.
>> > > On Wed, Jun 10, 2015 at 10:12 AM Aljoscha Krettek <[email protected]
>> >
>> > > wrote:
>> > >
>> > > > This is the answer I gave on the PR (we should have one place for
>> > > > discussing this, though):
>> > > >
>> > > > I would be against merging this in the current form. What I propose
>> is
>> > > > to analyse the topology to verify that there are no checkpointed
>> > > > operators inside iterations. Operators before and after iterations
>> can
>> > > > be checkpointed and we can safely allow the user to enable
>> > > > checkpointing.
>> > > >
>> > > > If we have the code to analyse which operators are inside iterations
>> > > > we could also disallow windows inside iterations. I think windows
>> > > > inside iterations don't make sense since elements in different
>> > > > "iterations" would end up in the same window. Maybe I'm wrong here
>> > > > though, then please correct me.
>> > > >
>> > > > On Wed, Jun 10, 2015 at 10:08 AM, Márton Balassi
>> > > > <[email protected]> wrote:
>> > > > > I agree that for the sake of the above mentioned use cases it is
>> > > > reasonable
>> > > > > to add this to the release with the right documentation, for
>> machine
>> > > > > learning potentially loosing one round of feedback data should not
>> > > > matter.
>> > > > >
>> > > > > Let us not block prominent users until the next release on this.
>> > > > >
>> > > > > On Wed, Jun 10, 2015 at 8:09 AM, Gyula Fóra <[email protected]>
>> > > > wrote:
>> > > > >
>> > > > >> As for people currently suffering from it:
>> > > > >>
>> > > > >> An application King is developing requires iterations, and they
>> need
>> > > > >> checkpoints. Practically all SAMOA programs would need this.
>> > > > >>
>> > > > >> It is very likely that the state interfaces will be changed after
>> > the
>> > > > >> release, so this is not something that we can just add later. I
>> > don't
>> > > > see a
>> > > > >> reason why we should not add it, as it is clearly documented. In
>> > this
>> > > > >> actual case not having guarantees at all means people will never
>> use
>> > > it
>> > > > in
>> > > > >> any production system. Having limited guarantees means that it
>> will
>> > > > depend
>> > > > >> on the application.
>> > > > >>
>> > > > >> On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi <[email protected]>
>> > wrote:
>> > > > >>
>> > > > >> > Hey Gyula,
>> > > > >> >
>> > > > >> > I understand your reasoning, but I don't think its worth to rush
>> > > this
>> > > > >> into
>> > > > >> > the release.
>> > > > >> >
>> > > > >> > As you've said, we cannot give precise guarantees. But this is
>> > > > arguably
>> > > > >> > one of the key requirements for any fault tolerance mechanism.
>> > > > Therefore
>> > > > >> I
>> > > > >> > disagree that this is better than not having anything at all. I
>> > > think
>> > > > it
>> > > > >> > will already go a long way to have the non-iterative case
>> working
>> > > > >> reliably.
>> > > > >> >
>> > > > >> > And as far as I know there are no users really suffering from
>> this
>> > > at
>> > > > the
>> > > > >> > moment (in the sense that someone has complained on the mailing
>> > > list).
>> > > > >> >
>> > > > >> > Hence, I vote to postpone this.
>> > > > >> >
>> > > > >> > – Ufuk
>> > > > >> >
>> > > > >> > On 10 Jun 2015, at 00:19, Gyula Fóra <[email protected]> wrote:
>> > > > >> >
>> > > > >> > > Hey all,
>> > > > >> > >
>> > > > >> > > It is currently impossible to enable state checkpointing for
>> > > > iterative
>> > > > >> > > jobs, because en exception is thrown when creating the
>> jobgraph.
>> > > > This
>> > > > >> > > behaviour is motivated by the lack of precise guarantees that
>> we
>> > > can
>> > > > >> give
>> > > > >> > > with the current fault-tolerance implementations for cyclic
>> > > graphs.
>> > > > >> > >
>> > > > >> > > This PR <https://github.com/apache/flink/pull/812> adds an
>> > > optional
>> > > > >> > flag to
>> > > > >> > > force checkpoints even in case of iterations. The algorithm
>> will
>> > > > take
>> > > > >> > > checkpoints periodically as before, but records in transit
>> > inside
>> > > > the
>> > > > >> > loop
>> > > > >> > > will be lost.
>> > > > >> > >
>> > > > >> > > However even this guarantee is enough for most applications
>> > > (Machine
>> > > > >> > > Learning for instance) and certainly much better than not
>> having
>> > > > >> anything
>> > > > >> > > at all.
>> > > > >> > >
>> > > > >> > >
>> > > > >> > > I suggest we add this to the 0.9 release as currently many
>> > > > applications
>> > > > >> > > suffer from this limitation (SAMOA, ML pipelines, graph
>> > streaming
>> > > > etc.)
>> > > > >> > >
>> > > > >> > >
>> > > > >> > > Cheers,
>> > > > >> > >
>> > > > >> > > Gyula
>> > > > >> >
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>>

Re: Force enabling checkpoints for iterative streaming jobs

Reply via email to