[GitHub] flink issue #4828: [FLINK-4816] [checkpoints] Executions failed from "DEPLOY...

StephanEwen Tue, 09 Jan 2018 10:01:29 -0800

Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/4828
  
    I think this approach is not yet sufficient. There can be various reasons 
why a failure in DEPLOY happens, failed checkpoint restore is only one of the 
reasons.
    
    This also adds some coupling of execution graph state and checkpoint 
coordinator (last restored checkpoint ID) which breaks design and 
responsibilities.
    
    A proper solution here is probably a bit more comprehensive - and need a 
bit more thinking, probably a bigger design document. my first though would be 
to report a proper RestoreException from the TaskManager, keeping a history of 
exceptions that triggered recovery, using that to evaluate fallback, etc.

---

[GitHub] flink issue #4828: [FLINK-4816] [checkpoints] Executions failed from "DEPLOY...

Reply via email to