This is the answer I gave on the PR (we should have one place for discussing this, though):
I would be against merging this in the current form. What I propose is to analyse the topology to verify that there are no checkpointed operators inside iterations. Operators before and after iterations can be checkpointed and we can safely allow the user to enable checkpointing. If we have the code to analyse which operators are inside iterations we could also disallow windows inside iterations. I think windows inside iterations don't make sense since elements in different "iterations" would end up in the same window. Maybe I'm wrong here though, then please correct me. On Wed, Jun 10, 2015 at 10:08 AM, Márton Balassi <balassi.mar...@gmail.com> wrote: > I agree that for the sake of the above mentioned use cases it is reasonable > to add this to the release with the right documentation, for machine > learning potentially loosing one round of feedback data should not matter. > > Let us not block prominent users until the next release on this. > > On Wed, Jun 10, 2015 at 8:09 AM, Gyula Fóra <gyula.f...@gmail.com> wrote: > >> As for people currently suffering from it: >> >> An application King is developing requires iterations, and they need >> checkpoints. Practically all SAMOA programs would need this. >> >> It is very likely that the state interfaces will be changed after the >> release, so this is not something that we can just add later. I don't see a >> reason why we should not add it, as it is clearly documented. In this >> actual case not having guarantees at all means people will never use it in >> any production system. Having limited guarantees means that it will depend >> on the application. >> >> On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi <u...@apache.org> wrote: >> >> > Hey Gyula, >> > >> > I understand your reasoning, but I don't think its worth to rush this >> into >> > the release. >> > >> > As you've said, we cannot give precise guarantees. But this is arguably >> > one of the key requirements for any fault tolerance mechanism. Therefore >> I >> > disagree that this is better than not having anything at all. I think it >> > will already go a long way to have the non-iterative case working >> reliably. >> > >> > And as far as I know there are no users really suffering from this at the >> > moment (in the sense that someone has complained on the mailing list). >> > >> > Hence, I vote to postpone this. >> > >> > – Ufuk >> > >> > On 10 Jun 2015, at 00:19, Gyula Fóra <gyf...@apache.org> wrote: >> > >> > > Hey all, >> > > >> > > It is currently impossible to enable state checkpointing for iterative >> > > jobs, because en exception is thrown when creating the jobgraph. This >> > > behaviour is motivated by the lack of precise guarantees that we can >> give >> > > with the current fault-tolerance implementations for cyclic graphs. >> > > >> > > This PR <https://github.com/apache/flink/pull/812> adds an optional >> > flag to >> > > force checkpoints even in case of iterations. The algorithm will take >> > > checkpoints periodically as before, but records in transit inside the >> > loop >> > > will be lost. >> > > >> > > However even this guarantee is enough for most applications (Machine >> > > Learning for instance) and certainly much better than not having >> anything >> > > at all. >> > > >> > > >> > > I suggest we add this to the 0.9 release as currently many applications >> > > suffer from this limitation (SAMOA, ML pipelines, graph streaming etc.) >> > > >> > > >> > > Cheers, >> > > >> > > Gyula >> > >> > >>