Thanks for summarizing the current state of Flip-1 and outlining the way to
move forward with it Chesnay.

I think we should implement the first version of the backtracking logic
using the DataConsumptionException (FLINK-6227) to signal if an
intermediate result partition has been lost.

Moreover, I think it would be best to base the new implementation on the
refined FailoverStrategy interface proposed by the scheduler refactorings
[1]. We could have an adaptor to make work with the existing code for
testing purposes and until the scheduler interfaces have been introduced.

Apart from that, +1 for completing Flink's first improvement proposal :-)

[1]
https://docs.google.com/document/d/1fstkML72YBO1tGD_dmG2rwvd9bklhRVauh4FSsDDwXU/edit?usp=sharing

Cheers,
Till

On Sun, Apr 14, 2019 at 8:20 PM Chesnay Schepler <ches...@apache.org> wrote:

> Hello everyone,
>
> Till, Zhu Zhu and myself have prepared a Design Document
> <
> https://docs.google.com/document/d/1YHOpMLdC-dtgjcM-EDn6v-oXgsEQKXSoMjqRcYVbJA8>
>
> for introducing backtracking for failover regions. This is an
> optimization of the failure handling logic for jobs with blocking result
> partitions (which primarily exist in batch jobs), where only part of the
> job has to be restarted.
> This has a continuation of the FLIP-1
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures>
>
> efforts to introduce fine-grained recovery from task failures.
> The associated JIRA can be found here
> <https://issues.apache.org/jira/browse/FLINK-12068>.
>
> Any feedback is highly appreciated.
>
> Regards,
> Chesnay
>

Reply via email to