Thanks for this very first proposal! Both the proposed functionality and the way you explained it are super nice. :-)
I think that this has been long overdue in Flink. :-) Having worked on both the ExecutionGraph and IntermediateResults before, I agree that these are the relevant components for this change. Version 1: - Conceptually I agree that this is the way to go. I think it's relatively straight forward to do this as you describe (minus all the surprises during implementation ;-)) - Very nice explanation with the figures! - Since FLIPs will probably also function as documentation, we might link to the nice figures in [1] for people who are not familiar with the details of the ExecutionGraph. [1] https://ci.apache.org/projects/flink/flink-docs-master/internals/job_scheduling.html#jobmanager-data-structures Version 2: - I think that the changes to the intermediate results and pinning will be straight forward. - An important follow up for this (probably another FLIP?) will be how we do memory management though. Right now the buffers for the intermediate results come from the "network buffer pool", which is by default very small (64MB). This is not a blocker for the implementation of Version 2, but probably for a good user experience. ;-) Public API changes: - RestartStrategy: I would expect this to be interpreted as maximum-total-task failures – Ufuk On Wed, Jul 13, 2016 at 8:20 AM, Aljoscha Krettek <aljos...@apache.org> wrote: > I added a FLIP document in the wiki: > https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures > > For now, this contains the link to the Google Doc and a link to this > discussion thread. Once a Jira is created for this it should also be added > there. > > On Tue, 12 Jul 2016 at 20:11 Chesnay Schepler <ches...@apache.org> wrote: > >> shouldn't the proposal be contained in the wiki instead of GoogleDocs? >> >> On 12.07.2016 19:55, Stephan Ewen wrote: >> > Hi all! >> > >> > Here is the very first FLIP (FLink Improvement Proposal): Fine grained >> > recovery from task failures >> > >> > It describes a proposed enhancement for reducing the work done during >> > recovery. >> > >> > >> https://docs.google.com/document/d/16S584XFzkfFu3MOfVCE0rHZ_JJgQrQuw9SXpanoMiMo >> > >> > Please comment in this mail thread, or in the GoogleDoc. >> > >> > Best, >> > Stephan >> > >> >>