[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280870#comment-17280870 ]
Piotr Nowojski commented on FLINK-17726: ---------------------------------------- Yes/maybe [~zhuzh]. I think you summarised the gist of the idea correctly. However there is one potential improvement: {quote} For"secondary" failures, given that the related "primary" failure should always be reported sooner or later, JM can simply mark the task as CANCELED and skip the failure handling. {quote} Maybe not in the first version, or maybe already in the first version, [~trohrmann] would like to tackle the problem to speed up failover, so that we do not have to wait for the primary failure to arrive. If JM already knows that some tasks started to fail (with secondary failures), it can already failover job/region, instead of waiting for example 1 minute for the heartbeats to time out. One thing that is not clear for me, is how to detect the primary failure in such case. Maybe we would need to failover the job but still keep collecting the failure reasons for the previous attempt, and keep updating the detected root cause lazily? For example if we have a chain of 4 tasks: A->B->C->D Maybe TaskManager handling A will fail silently, but the first error message JM will receive from D, then a second later from C then a second later from B and 1 minute later a timeout of A. Also note, that we don't have any pressure at the moment of fixing this right now. > Scheduler should take care of tasks directly canceled by TaskManager > -------------------------------------------------------------------- > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task > Affects Versions: 1.11.0, 1.12.0 > Reporter: Zhu Zhu > Priority: Critical > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)