The other tests verify that the checkpointing algorithm runs properly. That also ensures that it runs for iterations because a loop is just an extra source and sink in the jobgraph (so it is the same for the algorithm).
Fabian Hueske <fhue...@gmail.com> ezt írta (időpont: 2015. jún. 10., Sze, 11:19): > Without going into the details, how well tested is this feature? The PR > only extends one test by a few lines. > > Is that really enough to ensure that > 1) the change does not cause trouble > 2) is working as expected > > If this feature should go into the release, it must be thoroughly checked > and we must take the time for that. > Including code and hoping for the best because time is scarce is not an > option IMO. > > Fabian > > > 2015-06-10 11:05 GMT+02:00 Gyula Fóra <gyula.f...@gmail.com>: > > > And also I would like to remind everyone that any fault tolerance we > > provide is only as good as the fault tolerance of the master node. Which > is > > non existent at the moment. > > > > So I don't see a reason why a user should not be able to choose whether > he > > wants state checkpoints for iterations as well. > > > > In any case this will be used by King for instance, so making it part of > > the release would save a lot of work for everyone. > > > > Paris Carbone <par...@kth.se> ezt írta (időpont: 2015. jún. 10., Sze, > > 10:29): > > > > > > > > To continue Gyula's point, for consistent snapshots we need to persist > > the > > > records in transit within the loop and also slightly change the > current > > > protocol since it works only for DAGs. Before going into that direction > > > though I would propose we first see whether there is a nice way to make > > > iterations more structured. > > > > > > Paris > > > ________________________________________ > > > From: Gyula Fóra <gyula.f...@gmail.com> > > > Sent: Wednesday, June 10, 2015 10:19 AM > > > To: dev@flink.apache.org > > > Subject: Re: Force enabling checkpoints for iterative streaming jobs > > > > > > I disagree. Not having checkpointed operators inside the iteration > still > > > breaks the guarantees. > > > > > > It is not about the states it is about the loop itself. > > > On Wed, Jun 10, 2015 at 10:12 AM Aljoscha Krettek <aljos...@apache.org > > > > > wrote: > > > > > > > This is the answer I gave on the PR (we should have one place for > > > > discussing this, though): > > > > > > > > I would be against merging this in the current form. What I propose > is > > > > to analyse the topology to verify that there are no checkpointed > > > > operators inside iterations. Operators before and after iterations > can > > > > be checkpointed and we can safely allow the user to enable > > > > checkpointing. > > > > > > > > If we have the code to analyse which operators are inside iterations > > > > we could also disallow windows inside iterations. I think windows > > > > inside iterations don't make sense since elements in different > > > > "iterations" would end up in the same window. Maybe I'm wrong here > > > > though, then please correct me. > > > > > > > > On Wed, Jun 10, 2015 at 10:08 AM, Márton Balassi > > > > <balassi.mar...@gmail.com> wrote: > > > > > I agree that for the sake of the above mentioned use cases it is > > > > reasonable > > > > > to add this to the release with the right documentation, for > machine > > > > > learning potentially loosing one round of feedback data should not > > > > matter. > > > > > > > > > > Let us not block prominent users until the next release on this. > > > > > > > > > > On Wed, Jun 10, 2015 at 8:09 AM, Gyula Fóra <gyula.f...@gmail.com> > > > > wrote: > > > > > > > > > >> As for people currently suffering from it: > > > > >> > > > > >> An application King is developing requires iterations, and they > need > > > > >> checkpoints. Practically all SAMOA programs would need this. > > > > >> > > > > >> It is very likely that the state interfaces will be changed after > > the > > > > >> release, so this is not something that we can just add later. I > > don't > > > > see a > > > > >> reason why we should not add it, as it is clearly documented. In > > this > > > > >> actual case not having guarantees at all means people will never > use > > > it > > > > in > > > > >> any production system. Having limited guarantees means that it > will > > > > depend > > > > >> on the application. > > > > >> > > > > >> On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi <u...@apache.org> > > wrote: > > > > >> > > > > >> > Hey Gyula, > > > > >> > > > > > >> > I understand your reasoning, but I don't think its worth to rush > > > this > > > > >> into > > > > >> > the release. > > > > >> > > > > > >> > As you've said, we cannot give precise guarantees. But this is > > > > arguably > > > > >> > one of the key requirements for any fault tolerance mechanism. > > > > Therefore > > > > >> I > > > > >> > disagree that this is better than not having anything at all. I > > > think > > > > it > > > > >> > will already go a long way to have the non-iterative case > working > > > > >> reliably. > > > > >> > > > > > >> > And as far as I know there are no users really suffering from > this > > > at > > > > the > > > > >> > moment (in the sense that someone has complained on the mailing > > > list). > > > > >> > > > > > >> > Hence, I vote to postpone this. > > > > >> > > > > > >> > – Ufuk > > > > >> > > > > > >> > On 10 Jun 2015, at 00:19, Gyula Fóra <gyf...@apache.org> wrote: > > > > >> > > > > > >> > > Hey all, > > > > >> > > > > > > >> > > It is currently impossible to enable state checkpointing for > > > > iterative > > > > >> > > jobs, because en exception is thrown when creating the > jobgraph. > > > > This > > > > >> > > behaviour is motivated by the lack of precise guarantees that > we > > > can > > > > >> give > > > > >> > > with the current fault-tolerance implementations for cyclic > > > graphs. > > > > >> > > > > > > >> > > This PR <https://github.com/apache/flink/pull/812> adds an > > > optional > > > > >> > flag to > > > > >> > > force checkpoints even in case of iterations. The algorithm > will > > > > take > > > > >> > > checkpoints periodically as before, but records in transit > > inside > > > > the > > > > >> > loop > > > > >> > > will be lost. > > > > >> > > > > > > >> > > However even this guarantee is enough for most applications > > > (Machine > > > > >> > > Learning for instance) and certainly much better than not > having > > > > >> anything > > > > >> > > at all. > > > > >> > > > > > > >> > > > > > > >> > > I suggest we add this to the 0.9 release as currently many > > > > applications > > > > >> > > suffer from this limitation (SAMOA, ML pipelines, graph > > streaming > > > > etc.) > > > > >> > > > > > > >> > > > > > > >> > > Cheers, > > > > >> > > > > > > >> > > Gyula > > > > >> > > > > > >> > > > > > >> > > > > > > > > > >