The elements that are in-flight in an iteration are also state of the job. I'm wondering whether the state inside iterations still makes sense without these in-flight elements. But I also don't know the King use-case, that's why I though an example could be helpful.
On Wed, Jun 10, 2015 at 12:37 PM, Gyula Fóra <gyula.f...@gmail.com> wrote: > I don't understand the question, I vote for checkpointing all state in the > job, even inside iterations (its more of a loop). > > Aljoscha Krettek <aljos...@apache.org> ezt írta (időpont: 2015. jún. 10., > Sze, 12:34): > >> I don't understand why having the state inside an iteration but not >> the elements that correspond to this state or created this state is >> desirable. Maybe an example could help understand this better? >> >> On Wed, Jun 10, 2015 at 11:27 AM, Gyula Fóra <gyula.f...@gmail.com> wrote: >> > The other tests verify that the checkpointing algorithm runs properly. >> That >> > also ensures that it runs for iterations because a loop is just an extra >> > source and sink in the jobgraph (so it is the same for the algorithm). >> > >> > Fabian Hueske <fhue...@gmail.com> ezt írta (időpont: 2015. jún. 10., >> Sze, >> > 11:19): >> > >> >> Without going into the details, how well tested is this feature? The PR >> >> only extends one test by a few lines. >> >> >> >> Is that really enough to ensure that >> >> 1) the change does not cause trouble >> >> 2) is working as expected >> >> >> >> If this feature should go into the release, it must be thoroughly >> checked >> >> and we must take the time for that. >> >> Including code and hoping for the best because time is scarce is not an >> >> option IMO. >> >> >> >> Fabian >> >> >> >> >> >> 2015-06-10 11:05 GMT+02:00 Gyula Fóra <gyula.f...@gmail.com>: >> >> >> >> > And also I would like to remind everyone that any fault tolerance we >> >> > provide is only as good as the fault tolerance of the master node. >> Which >> >> is >> >> > non existent at the moment. >> >> > >> >> > So I don't see a reason why a user should not be able to choose >> whether >> >> he >> >> > wants state checkpoints for iterations as well. >> >> > >> >> > In any case this will be used by King for instance, so making it part >> of >> >> > the release would save a lot of work for everyone. >> >> > >> >> > Paris Carbone <par...@kth.se> ezt írta (időpont: 2015. jún. 10., Sze, >> >> > 10:29): >> >> > >> >> > > >> >> > > To continue Gyula's point, for consistent snapshots we need to >> persist >> >> > the >> >> > > records in transit within the loop and also slightly change the >> >> current >> >> > > protocol since it works only for DAGs. Before going into that >> direction >> >> > > though I would propose we first see whether there is a nice way to >> make >> >> > > iterations more structured. >> >> > > >> >> > > Paris >> >> > > ________________________________________ >> >> > > From: Gyula Fóra <gyula.f...@gmail.com> >> >> > > Sent: Wednesday, June 10, 2015 10:19 AM >> >> > > To: dev@flink.apache.org >> >> > > Subject: Re: Force enabling checkpoints for iterative streaming jobs >> >> > > >> >> > > I disagree. Not having checkpointed operators inside the iteration >> >> still >> >> > > breaks the guarantees. >> >> > > >> >> > > It is not about the states it is about the loop itself. >> >> > > On Wed, Jun 10, 2015 at 10:12 AM Aljoscha Krettek < >> aljos...@apache.org >> >> > >> >> > > wrote: >> >> > > >> >> > > > This is the answer I gave on the PR (we should have one place for >> >> > > > discussing this, though): >> >> > > > >> >> > > > I would be against merging this in the current form. What I >> propose >> >> is >> >> > > > to analyse the topology to verify that there are no checkpointed >> >> > > > operators inside iterations. Operators before and after iterations >> >> can >> >> > > > be checkpointed and we can safely allow the user to enable >> >> > > > checkpointing. >> >> > > > >> >> > > > If we have the code to analyse which operators are inside >> iterations >> >> > > > we could also disallow windows inside iterations. I think windows >> >> > > > inside iterations don't make sense since elements in different >> >> > > > "iterations" would end up in the same window. Maybe I'm wrong here >> >> > > > though, then please correct me. >> >> > > > >> >> > > > On Wed, Jun 10, 2015 at 10:08 AM, Márton Balassi >> >> > > > <balassi.mar...@gmail.com> wrote: >> >> > > > > I agree that for the sake of the above mentioned use cases it is >> >> > > > reasonable >> >> > > > > to add this to the release with the right documentation, for >> >> machine >> >> > > > > learning potentially loosing one round of feedback data should >> not >> >> > > > matter. >> >> > > > > >> >> > > > > Let us not block prominent users until the next release on this. >> >> > > > > >> >> > > > > On Wed, Jun 10, 2015 at 8:09 AM, Gyula Fóra < >> gyula.f...@gmail.com> >> >> > > > wrote: >> >> > > > > >> >> > > > >> As for people currently suffering from it: >> >> > > > >> >> >> > > > >> An application King is developing requires iterations, and they >> >> need >> >> > > > >> checkpoints. Practically all SAMOA programs would need this. >> >> > > > >> >> >> > > > >> It is very likely that the state interfaces will be changed >> after >> >> > the >> >> > > > >> release, so this is not something that we can just add later. I >> >> > don't >> >> > > > see a >> >> > > > >> reason why we should not add it, as it is clearly documented. >> In >> >> > this >> >> > > > >> actual case not having guarantees at all means people will >> never >> >> use >> >> > > it >> >> > > > in >> >> > > > >> any production system. Having limited guarantees means that it >> >> will >> >> > > > depend >> >> > > > >> on the application. >> >> > > > >> >> >> > > > >> On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi <u...@apache.org> >> >> > wrote: >> >> > > > >> >> >> > > > >> > Hey Gyula, >> >> > > > >> > >> >> > > > >> > I understand your reasoning, but I don't think its worth to >> rush >> >> > > this >> >> > > > >> into >> >> > > > >> > the release. >> >> > > > >> > >> >> > > > >> > As you've said, we cannot give precise guarantees. But this >> is >> >> > > > arguably >> >> > > > >> > one of the key requirements for any fault tolerance >> mechanism. >> >> > > > Therefore >> >> > > > >> I >> >> > > > >> > disagree that this is better than not having anything at >> all. I >> >> > > think >> >> > > > it >> >> > > > >> > will already go a long way to have the non-iterative case >> >> working >> >> > > > >> reliably. >> >> > > > >> > >> >> > > > >> > And as far as I know there are no users really suffering from >> >> this >> >> > > at >> >> > > > the >> >> > > > >> > moment (in the sense that someone has complained on the >> mailing >> >> > > list). >> >> > > > >> > >> >> > > > >> > Hence, I vote to postpone this. >> >> > > > >> > >> >> > > > >> > – Ufuk >> >> > > > >> > >> >> > > > >> > On 10 Jun 2015, at 00:19, Gyula Fóra <gyf...@apache.org> >> wrote: >> >> > > > >> > >> >> > > > >> > > Hey all, >> >> > > > >> > > >> >> > > > >> > > It is currently impossible to enable state checkpointing >> for >> >> > > > iterative >> >> > > > >> > > jobs, because en exception is thrown when creating the >> >> jobgraph. >> >> > > > This >> >> > > > >> > > behaviour is motivated by the lack of precise guarantees >> that >> >> we >> >> > > can >> >> > > > >> give >> >> > > > >> > > with the current fault-tolerance implementations for cyclic >> >> > > graphs. >> >> > > > >> > > >> >> > > > >> > > This PR <https://github.com/apache/flink/pull/812> adds an >> >> > > optional >> >> > > > >> > flag to >> >> > > > >> > > force checkpoints even in case of iterations. The algorithm >> >> will >> >> > > > take >> >> > > > >> > > checkpoints periodically as before, but records in transit >> >> > inside >> >> > > > the >> >> > > > >> > loop >> >> > > > >> > > will be lost. >> >> > > > >> > > >> >> > > > >> > > However even this guarantee is enough for most applications >> >> > > (Machine >> >> > > > >> > > Learning for instance) and certainly much better than not >> >> having >> >> > > > >> anything >> >> > > > >> > > at all. >> >> > > > >> > > >> >> > > > >> > > >> >> > > > >> > > I suggest we add this to the 0.9 release as currently many >> >> > > > applications >> >> > > > >> > > suffer from this limitation (SAMOA, ML pipelines, graph >> >> > streaming >> >> > > > etc.) >> >> > > > >> > > >> >> > > > >> > > >> >> > > > >> > > Cheers, >> >> > > > >> > > >> >> > > > >> > > Gyula >> >> > > > >> > >> >> > > > >> > >> >> > > > >> >> >> > > > >> >> > > >> >> > >> >> >>