Thanks :D, now I see. It makes sense because we don't have another way of keeping the cluster state synced/distributed across parallel instances of the operators.
On Wed, Jun 10, 2015 at 12:52 PM, Gyula Fóra <gyula.f...@gmail.com> wrote: > Here is an example for you: > > Parallel streaming kmeans, the state we keep is the current cluster > centers, and we use iterations to sync the centers across parallel > instances. > We can afford lost model updated in the loop but we need the checkpoint the > models. > > https://github.com/gyfora/stream-clustering/blob/master/src/main/scala/stream/clustering/StreamClustering.scala > > (checkpointing is not turned on but you will get the point) > > > > Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2015. jún. 10., Sze, > 12:47): > >> You are right, to have consistent results we would need to persist the >> records. >> >> But since we cannot do that right now, we can still checkpoint all >> operator states and understand that inflight records in the loop are lost >> on failure. >> >> This is acceptable for most the use-cases that we have developed so far >> for iterations (machine learning, graph updates, etc.) What is not >> acceptable is to not have checkpointing at all. >> >> Aljoscha Krettek <aljos...@apache.org> ezt írta (időpont: 2015. jún. 10., >> Sze, 12:43): >> >>> The elements that are in-flight in an iteration are also state of the >>> job. I'm wondering whether the state inside iterations still makes >>> sense without these in-flight elements. But I also don't know the King >>> use-case, that's why I though an example could be helpful. >>> >>> On Wed, Jun 10, 2015 at 12:37 PM, Gyula Fóra <gyula.f...@gmail.com> >>> wrote: >>> > I don't understand the question, I vote for checkpointing all state in >>> the >>> > job, even inside iterations (its more of a loop). >>> > >>> > Aljoscha Krettek <aljos...@apache.org> ezt írta (időpont: 2015. jún. >>> 10., >>> > Sze, 12:34): >>> > >>> >> I don't understand why having the state inside an iteration but not >>> >> the elements that correspond to this state or created this state is >>> >> desirable. Maybe an example could help understand this better? >>> >> >>> >> On Wed, Jun 10, 2015 at 11:27 AM, Gyula Fóra <gyula.f...@gmail.com> >>> wrote: >>> >> > The other tests verify that the checkpointing algorithm runs >>> properly. >>> >> That >>> >> > also ensures that it runs for iterations because a loop is just an >>> extra >>> >> > source and sink in the jobgraph (so it is the same for the >>> algorithm). >>> >> > >>> >> > Fabian Hueske <fhue...@gmail.com> ezt írta (időpont: 2015. jún. 10., >>> >> Sze, >>> >> > 11:19): >>> >> > >>> >> >> Without going into the details, how well tested is this feature? >>> The PR >>> >> >> only extends one test by a few lines. >>> >> >> >>> >> >> Is that really enough to ensure that >>> >> >> 1) the change does not cause trouble >>> >> >> 2) is working as expected >>> >> >> >>> >> >> If this feature should go into the release, it must be thoroughly >>> >> checked >>> >> >> and we must take the time for that. >>> >> >> Including code and hoping for the best because time is scarce is >>> not an >>> >> >> option IMO. >>> >> >> >>> >> >> Fabian >>> >> >> >>> >> >> >>> >> >> 2015-06-10 11:05 GMT+02:00 Gyula Fóra <gyula.f...@gmail.com>: >>> >> >> >>> >> >> > And also I would like to remind everyone that any fault tolerance >>> we >>> >> >> > provide is only as good as the fault tolerance of the master node. >>> >> Which >>> >> >> is >>> >> >> > non existent at the moment. >>> >> >> > >>> >> >> > So I don't see a reason why a user should not be able to choose >>> >> whether >>> >> >> he >>> >> >> > wants state checkpoints for iterations as well. >>> >> >> > >>> >> >> > In any case this will be used by King for instance, so making it >>> part >>> >> of >>> >> >> > the release would save a lot of work for everyone. >>> >> >> > >>> >> >> > Paris Carbone <par...@kth.se> ezt írta (időpont: 2015. jún. 10., >>> Sze, >>> >> >> > 10:29): >>> >> >> > >>> >> >> > > >>> >> >> > > To continue Gyula's point, for consistent snapshots we need to >>> >> persist >>> >> >> > the >>> >> >> > > records in transit within the loop and also slightly change the >>> >> >> current >>> >> >> > > protocol since it works only for DAGs. Before going into that >>> >> direction >>> >> >> > > though I would propose we first see whether there is a nice way >>> to >>> >> make >>> >> >> > > iterations more structured. >>> >> >> > > >>> >> >> > > Paris >>> >> >> > > ________________________________________ >>> >> >> > > From: Gyula Fóra <gyula.f...@gmail.com> >>> >> >> > > Sent: Wednesday, June 10, 2015 10:19 AM >>> >> >> > > To: dev@flink.apache.org >>> >> >> > > Subject: Re: Force enabling checkpoints for iterative streaming >>> jobs >>> >> >> > > >>> >> >> > > I disagree. Not having checkpointed operators inside the >>> iteration >>> >> >> still >>> >> >> > > breaks the guarantees. >>> >> >> > > >>> >> >> > > It is not about the states it is about the loop itself. >>> >> >> > > On Wed, Jun 10, 2015 at 10:12 AM Aljoscha Krettek < >>> >> aljos...@apache.org >>> >> >> > >>> >> >> > > wrote: >>> >> >> > > >>> >> >> > > > This is the answer I gave on the PR (we should have one place >>> for >>> >> >> > > > discussing this, though): >>> >> >> > > > >>> >> >> > > > I would be against merging this in the current form. What I >>> >> propose >>> >> >> is >>> >> >> > > > to analyse the topology to verify that there are no >>> checkpointed >>> >> >> > > > operators inside iterations. Operators before and after >>> iterations >>> >> >> can >>> >> >> > > > be checkpointed and we can safely allow the user to enable >>> >> >> > > > checkpointing. >>> >> >> > > > >>> >> >> > > > If we have the code to analyse which operators are inside >>> >> iterations >>> >> >> > > > we could also disallow windows inside iterations. I think >>> windows >>> >> >> > > > inside iterations don't make sense since elements in different >>> >> >> > > > "iterations" would end up in the same window. Maybe I'm wrong >>> here >>> >> >> > > > though, then please correct me. >>> >> >> > > > >>> >> >> > > > On Wed, Jun 10, 2015 at 10:08 AM, Márton Balassi >>> >> >> > > > <balassi.mar...@gmail.com> wrote: >>> >> >> > > > > I agree that for the sake of the above mentioned use cases >>> it is >>> >> >> > > > reasonable >>> >> >> > > > > to add this to the release with the right documentation, for >>> >> >> machine >>> >> >> > > > > learning potentially loosing one round of feedback data >>> should >>> >> not >>> >> >> > > > matter. >>> >> >> > > > > >>> >> >> > > > > Let us not block prominent users until the next release on >>> this. >>> >> >> > > > > >>> >> >> > > > > On Wed, Jun 10, 2015 at 8:09 AM, Gyula Fóra < >>> >> gyula.f...@gmail.com> >>> >> >> > > > wrote: >>> >> >> > > > > >>> >> >> > > > >> As for people currently suffering from it: >>> >> >> > > > >> >>> >> >> > > > >> An application King is developing requires iterations, and >>> they >>> >> >> need >>> >> >> > > > >> checkpoints. Practically all SAMOA programs would need >>> this. >>> >> >> > > > >> >>> >> >> > > > >> It is very likely that the state interfaces will be changed >>> >> after >>> >> >> > the >>> >> >> > > > >> release, so this is not something that we can just add >>> later. I >>> >> >> > don't >>> >> >> > > > see a >>> >> >> > > > >> reason why we should not add it, as it is clearly >>> documented. >>> >> In >>> >> >> > this >>> >> >> > > > >> actual case not having guarantees at all means people will >>> >> never >>> >> >> use >>> >> >> > > it >>> >> >> > > > in >>> >> >> > > > >> any production system. Having limited guarantees means >>> that it >>> >> >> will >>> >> >> > > > depend >>> >> >> > > > >> on the application. >>> >> >> > > > >> >>> >> >> > > > >> On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi < >>> u...@apache.org> >>> >> >> > wrote: >>> >> >> > > > >> >>> >> >> > > > >> > Hey Gyula, >>> >> >> > > > >> > >>> >> >> > > > >> > I understand your reasoning, but I don't think its worth >>> to >>> >> rush >>> >> >> > > this >>> >> >> > > > >> into >>> >> >> > > > >> > the release. >>> >> >> > > > >> > >>> >> >> > > > >> > As you've said, we cannot give precise guarantees. But >>> this >>> >> is >>> >> >> > > > arguably >>> >> >> > > > >> > one of the key requirements for any fault tolerance >>> >> mechanism. >>> >> >> > > > Therefore >>> >> >> > > > >> I >>> >> >> > > > >> > disagree that this is better than not having anything at >>> >> all. I >>> >> >> > > think >>> >> >> > > > it >>> >> >> > > > >> > will already go a long way to have the non-iterative case >>> >> >> working >>> >> >> > > > >> reliably. >>> >> >> > > > >> > >>> >> >> > > > >> > And as far as I know there are no users really suffering >>> from >>> >> >> this >>> >> >> > > at >>> >> >> > > > the >>> >> >> > > > >> > moment (in the sense that someone has complained on the >>> >> mailing >>> >> >> > > list). >>> >> >> > > > >> > >>> >> >> > > > >> > Hence, I vote to postpone this. >>> >> >> > > > >> > >>> >> >> > > > >> > – Ufuk >>> >> >> > > > >> > >>> >> >> > > > >> > On 10 Jun 2015, at 00:19, Gyula Fóra <gyf...@apache.org> >>> >> wrote: >>> >> >> > > > >> > >>> >> >> > > > >> > > Hey all, >>> >> >> > > > >> > > >>> >> >> > > > >> > > It is currently impossible to enable state >>> checkpointing >>> >> for >>> >> >> > > > iterative >>> >> >> > > > >> > > jobs, because en exception is thrown when creating the >>> >> >> jobgraph. >>> >> >> > > > This >>> >> >> > > > >> > > behaviour is motivated by the lack of precise >>> guarantees >>> >> that >>> >> >> we >>> >> >> > > can >>> >> >> > > > >> give >>> >> >> > > > >> > > with the current fault-tolerance implementations for >>> cyclic >>> >> >> > > graphs. >>> >> >> > > > >> > > >>> >> >> > > > >> > > This PR <https://github.com/apache/flink/pull/812> >>> adds an >>> >> >> > > optional >>> >> >> > > > >> > flag to >>> >> >> > > > >> > > force checkpoints even in case of iterations. The >>> algorithm >>> >> >> will >>> >> >> > > > take >>> >> >> > > > >> > > checkpoints periodically as before, but records in >>> transit >>> >> >> > inside >>> >> >> > > > the >>> >> >> > > > >> > loop >>> >> >> > > > >> > > will be lost. >>> >> >> > > > >> > > >>> >> >> > > > >> > > However even this guarantee is enough for most >>> applications >>> >> >> > > (Machine >>> >> >> > > > >> > > Learning for instance) and certainly much better than >>> not >>> >> >> having >>> >> >> > > > >> anything >>> >> >> > > > >> > > at all. >>> >> >> > > > >> > > >>> >> >> > > > >> > > >>> >> >> > > > >> > > I suggest we add this to the 0.9 release as currently >>> many >>> >> >> > > > applications >>> >> >> > > > >> > > suffer from this limitation (SAMOA, ML pipelines, graph >>> >> >> > streaming >>> >> >> > > > etc.) >>> >> >> > > > >> > > >>> >> >> > > > >> > > >>> >> >> > > > >> > > Cheers, >>> >> >> > > > >> > > >>> >> >> > > > >> > > Gyula >>> >> >> > > > >> > >>> >> >> > > > >> > >>> >> >> > > > >> >>> >> >> > > > >>> >> >> > > >>> >> >> > >>> >> >> >>> >> >>> >>