Max suggested that I add this feature slightly hidden to the execution config instance.
The problem then is that I either make a public field in the config or once again add a method. Any ideas? Aljoscha Krettek <aljos...@apache.org> ezt írta (időpont: 2015. jún. 10., Sze, 14:07): > Thanks :D, now I see. It makes sense because we don't have another way > of keeping the cluster state synced/distributed across parallel > instances of the operators. > > On Wed, Jun 10, 2015 at 12:52 PM, Gyula Fóra <gyula.f...@gmail.com> wrote: > > Here is an example for you: > > > > Parallel streaming kmeans, the state we keep is the current cluster > > centers, and we use iterations to sync the centers across parallel > > instances. > > We can afford lost model updated in the loop but we need the checkpoint > the > > models. > > > > > https://github.com/gyfora/stream-clustering/blob/master/src/main/scala/stream/clustering/StreamClustering.scala > > > > (checkpointing is not turned on but you will get the point) > > > > > > > > Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2015. jún. 10., > Sze, > > 12:47): > > > >> You are right, to have consistent results we would need to persist the > >> records. > >> > >> But since we cannot do that right now, we can still checkpoint all > >> operator states and understand that inflight records in the loop are > lost > >> on failure. > >> > >> This is acceptable for most the use-cases that we have developed so far > >> for iterations (machine learning, graph updates, etc.) What is not > >> acceptable is to not have checkpointing at all. > >> > >> Aljoscha Krettek <aljos...@apache.org> ezt írta (időpont: 2015. jún. > 10., > >> Sze, 12:43): > >> > >>> The elements that are in-flight in an iteration are also state of the > >>> job. I'm wondering whether the state inside iterations still makes > >>> sense without these in-flight elements. But I also don't know the King > >>> use-case, that's why I though an example could be helpful. > >>> > >>> On Wed, Jun 10, 2015 at 12:37 PM, Gyula Fóra <gyula.f...@gmail.com> > >>> wrote: > >>> > I don't understand the question, I vote for checkpointing all state > in > >>> the > >>> > job, even inside iterations (its more of a loop). > >>> > > >>> > Aljoscha Krettek <aljos...@apache.org> ezt írta (időpont: 2015. jún. > >>> 10., > >>> > Sze, 12:34): > >>> > > >>> >> I don't understand why having the state inside an iteration but not > >>> >> the elements that correspond to this state or created this state is > >>> >> desirable. Maybe an example could help understand this better? > >>> >> > >>> >> On Wed, Jun 10, 2015 at 11:27 AM, Gyula Fóra <gyula.f...@gmail.com> > >>> wrote: > >>> >> > The other tests verify that the checkpointing algorithm runs > >>> properly. > >>> >> That > >>> >> > also ensures that it runs for iterations because a loop is just an > >>> extra > >>> >> > source and sink in the jobgraph (so it is the same for the > >>> algorithm). > >>> >> > > >>> >> > Fabian Hueske <fhue...@gmail.com> ezt írta (időpont: 2015. jún. > 10., > >>> >> Sze, > >>> >> > 11:19): > >>> >> > > >>> >> >> Without going into the details, how well tested is this feature? > >>> The PR > >>> >> >> only extends one test by a few lines. > >>> >> >> > >>> >> >> Is that really enough to ensure that > >>> >> >> 1) the change does not cause trouble > >>> >> >> 2) is working as expected > >>> >> >> > >>> >> >> If this feature should go into the release, it must be thoroughly > >>> >> checked > >>> >> >> and we must take the time for that. > >>> >> >> Including code and hoping for the best because time is scarce is > >>> not an > >>> >> >> option IMO. > >>> >> >> > >>> >> >> Fabian > >>> >> >> > >>> >> >> > >>> >> >> 2015-06-10 11:05 GMT+02:00 Gyula Fóra <gyula.f...@gmail.com>: > >>> >> >> > >>> >> >> > And also I would like to remind everyone that any fault > tolerance > >>> we > >>> >> >> > provide is only as good as the fault tolerance of the master > node. > >>> >> Which > >>> >> >> is > >>> >> >> > non existent at the moment. > >>> >> >> > > >>> >> >> > So I don't see a reason why a user should not be able to choose > >>> >> whether > >>> >> >> he > >>> >> >> > wants state checkpoints for iterations as well. > >>> >> >> > > >>> >> >> > In any case this will be used by King for instance, so making > it > >>> part > >>> >> of > >>> >> >> > the release would save a lot of work for everyone. > >>> >> >> > > >>> >> >> > Paris Carbone <par...@kth.se> ezt írta (időpont: 2015. jún. > 10., > >>> Sze, > >>> >> >> > 10:29): > >>> >> >> > > >>> >> >> > > > >>> >> >> > > To continue Gyula's point, for consistent snapshots we need > to > >>> >> persist > >>> >> >> > the > >>> >> >> > > records in transit within the loop and also slightly change > the > >>> >> >> current > >>> >> >> > > protocol since it works only for DAGs. Before going into that > >>> >> direction > >>> >> >> > > though I would propose we first see whether there is a nice > way > >>> to > >>> >> make > >>> >> >> > > iterations more structured. > >>> >> >> > > > >>> >> >> > > Paris > >>> >> >> > > ________________________________________ > >>> >> >> > > From: Gyula Fóra <gyula.f...@gmail.com> > >>> >> >> > > Sent: Wednesday, June 10, 2015 10:19 AM > >>> >> >> > > To: dev@flink.apache.org > >>> >> >> > > Subject: Re: Force enabling checkpoints for iterative > streaming > >>> jobs > >>> >> >> > > > >>> >> >> > > I disagree. Not having checkpointed operators inside the > >>> iteration > >>> >> >> still > >>> >> >> > > breaks the guarantees. > >>> >> >> > > > >>> >> >> > > It is not about the states it is about the loop itself. > >>> >> >> > > On Wed, Jun 10, 2015 at 10:12 AM Aljoscha Krettek < > >>> >> aljos...@apache.org > >>> >> >> > > >>> >> >> > > wrote: > >>> >> >> > > > >>> >> >> > > > This is the answer I gave on the PR (we should have one > place > >>> for > >>> >> >> > > > discussing this, though): > >>> >> >> > > > > >>> >> >> > > > I would be against merging this in the current form. What I > >>> >> propose > >>> >> >> is > >>> >> >> > > > to analyse the topology to verify that there are no > >>> checkpointed > >>> >> >> > > > operators inside iterations. Operators before and after > >>> iterations > >>> >> >> can > >>> >> >> > > > be checkpointed and we can safely allow the user to enable > >>> >> >> > > > checkpointing. > >>> >> >> > > > > >>> >> >> > > > If we have the code to analyse which operators are inside > >>> >> iterations > >>> >> >> > > > we could also disallow windows inside iterations. I think > >>> windows > >>> >> >> > > > inside iterations don't make sense since elements in > different > >>> >> >> > > > "iterations" would end up in the same window. Maybe I'm > wrong > >>> here > >>> >> >> > > > though, then please correct me. > >>> >> >> > > > > >>> >> >> > > > On Wed, Jun 10, 2015 at 10:08 AM, Márton Balassi > >>> >> >> > > > <balassi.mar...@gmail.com> wrote: > >>> >> >> > > > > I agree that for the sake of the above mentioned use > cases > >>> it is > >>> >> >> > > > reasonable > >>> >> >> > > > > to add this to the release with the right documentation, > for > >>> >> >> machine > >>> >> >> > > > > learning potentially loosing one round of feedback data > >>> should > >>> >> not > >>> >> >> > > > matter. > >>> >> >> > > > > > >>> >> >> > > > > Let us not block prominent users until the next release > on > >>> this. > >>> >> >> > > > > > >>> >> >> > > > > On Wed, Jun 10, 2015 at 8:09 AM, Gyula Fóra < > >>> >> gyula.f...@gmail.com> > >>> >> >> > > > wrote: > >>> >> >> > > > > > >>> >> >> > > > >> As for people currently suffering from it: > >>> >> >> > > > >> > >>> >> >> > > > >> An application King is developing requires iterations, > and > >>> they > >>> >> >> need > >>> >> >> > > > >> checkpoints. Practically all SAMOA programs would need > >>> this. > >>> >> >> > > > >> > >>> >> >> > > > >> It is very likely that the state interfaces will be > changed > >>> >> after > >>> >> >> > the > >>> >> >> > > > >> release, so this is not something that we can just add > >>> later. I > >>> >> >> > don't > >>> >> >> > > > see a > >>> >> >> > > > >> reason why we should not add it, as it is clearly > >>> documented. > >>> >> In > >>> >> >> > this > >>> >> >> > > > >> actual case not having guarantees at all means people > will > >>> >> never > >>> >> >> use > >>> >> >> > > it > >>> >> >> > > > in > >>> >> >> > > > >> any production system. Having limited guarantees means > >>> that it > >>> >> >> will > >>> >> >> > > > depend > >>> >> >> > > > >> on the application. > >>> >> >> > > > >> > >>> >> >> > > > >> On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi < > >>> u...@apache.org> > >>> >> >> > wrote: > >>> >> >> > > > >> > >>> >> >> > > > >> > Hey Gyula, > >>> >> >> > > > >> > > >>> >> >> > > > >> > I understand your reasoning, but I don't think its > worth > >>> to > >>> >> rush > >>> >> >> > > this > >>> >> >> > > > >> into > >>> >> >> > > > >> > the release. > >>> >> >> > > > >> > > >>> >> >> > > > >> > As you've said, we cannot give precise guarantees. But > >>> this > >>> >> is > >>> >> >> > > > arguably > >>> >> >> > > > >> > one of the key requirements for any fault tolerance > >>> >> mechanism. > >>> >> >> > > > Therefore > >>> >> >> > > > >> I > >>> >> >> > > > >> > disagree that this is better than not having anything > at > >>> >> all. I > >>> >> >> > > think > >>> >> >> > > > it > >>> >> >> > > > >> > will already go a long way to have the non-iterative > case > >>> >> >> working > >>> >> >> > > > >> reliably. > >>> >> >> > > > >> > > >>> >> >> > > > >> > And as far as I know there are no users really > suffering > >>> from > >>> >> >> this > >>> >> >> > > at > >>> >> >> > > > the > >>> >> >> > > > >> > moment (in the sense that someone has complained on > the > >>> >> mailing > >>> >> >> > > list). > >>> >> >> > > > >> > > >>> >> >> > > > >> > Hence, I vote to postpone this. > >>> >> >> > > > >> > > >>> >> >> > > > >> > – Ufuk > >>> >> >> > > > >> > > >>> >> >> > > > >> > On 10 Jun 2015, at 00:19, Gyula Fóra < > gyf...@apache.org> > >>> >> wrote: > >>> >> >> > > > >> > > >>> >> >> > > > >> > > Hey all, > >>> >> >> > > > >> > > > >>> >> >> > > > >> > > It is currently impossible to enable state > >>> checkpointing > >>> >> for > >>> >> >> > > > iterative > >>> >> >> > > > >> > > jobs, because en exception is thrown when creating > the > >>> >> >> jobgraph. > >>> >> >> > > > This > >>> >> >> > > > >> > > behaviour is motivated by the lack of precise > >>> guarantees > >>> >> that > >>> >> >> we > >>> >> >> > > can > >>> >> >> > > > >> give > >>> >> >> > > > >> > > with the current fault-tolerance implementations for > >>> cyclic > >>> >> >> > > graphs. > >>> >> >> > > > >> > > > >>> >> >> > > > >> > > This PR <https://github.com/apache/flink/pull/812> > >>> adds an > >>> >> >> > > optional > >>> >> >> > > > >> > flag to > >>> >> >> > > > >> > > force checkpoints even in case of iterations. The > >>> algorithm > >>> >> >> will > >>> >> >> > > > take > >>> >> >> > > > >> > > checkpoints periodically as before, but records in > >>> transit > >>> >> >> > inside > >>> >> >> > > > the > >>> >> >> > > > >> > loop > >>> >> >> > > > >> > > will be lost. > >>> >> >> > > > >> > > > >>> >> >> > > > >> > > However even this guarantee is enough for most > >>> applications > >>> >> >> > > (Machine > >>> >> >> > > > >> > > Learning for instance) and certainly much better > than > >>> not > >>> >> >> having > >>> >> >> > > > >> anything > >>> >> >> > > > >> > > at all. > >>> >> >> > > > >> > > > >>> >> >> > > > >> > > > >>> >> >> > > > >> > > I suggest we add this to the 0.9 release as > currently > >>> many > >>> >> >> > > > applications > >>> >> >> > > > >> > > suffer from this limitation (SAMOA, ML pipelines, > graph > >>> >> >> > streaming > >>> >> >> > > > etc.) > >>> >> >> > > > >> > > > >>> >> >> > > > >> > > > >>> >> >> > > > >> > > Cheers, > >>> >> >> > > > >> > > > >>> >> >> > > > >> > > Gyula > >>> >> >> > > > >> > > >>> >> >> > > > >> > > >>> >> >> > > > >> > >>> >> >> > > > > >>> >> >> > > > >>> >> >> > > >>> >> >> > >>> >> > >>> > >> >