[ https://issues.apache.org/jira/browse/FLINK-21439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314927#comment-17314927 ]
Matthias edited comment on FLINK-21439 at 4/5/21, 3:27 PM: ----------------------------------------------------------- Hi John, thank you for your proposal. You correctly identified {{SchedulerNG.handleGlobalFailure}} and {{SchedulerNG.updateTaskExecutionState}} as the entry points for failure handling. This also apply to the {{AdaptiveScheduler}}. About proposing to use the {{ExceptionHistoryEntry}}'s static {{from*}} factory methods: There was some work done as part of FLINK-21189 that got recently merged. I should have pinged you on that one. Sorry for that. A new class {{ExceptionHistoryEntryExtractor}} was introduced that deals with collecting all relevant information from the {{ExecutionGraph}} to create {{RootExceptionHistoryEntry}} instances. This enables us to handle failures that were caught while handling already another failure. The {{AdaptiveScheduler}} only deals with global fail overs for now (see the corresponding [FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler]), i.e. all failures are global failures ([~rmetzger] please correct me if I'm wrong here). Concurrent failures can still happen, though. These failures are "swallowed" in the [Restarting|https://github.com/apache/flink/blob/ca968d305a99b63162136589e1d9f6ba4c9cdd2b/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/Restarting.java#L78-L86] state. We might want to collect these failures and add it to the corresponding {{RootExceptionHistoryEntry}}. Having the {{BoundedFIFOQueue}} in the {{AdaptiveScheduler}} class makes sense to me. But there needs to be a way for the {{State}} implementation to populate that collection. Does this makes sense to you? was (Author: mapohl): Hi John, thank you for your proposal. You correctly identified {{SchedulerNG.handleGlobalFailure}} or {{SchedulerNG.updateTaskExecutionState}} as the entry points for failure handling. This also apply to the {{AdaptiveScheduler}}. About proposing to use the {{ExceptionHistoryEntry}}'s static {{from*}} factory methods: There was some work done as part of FLINK-21189 that got recently merged. I should have pinged you on that one. Sorry for that. A new class {{ExceptionHistoryEntryExtractor}} was introduced that deals with collecting all relevant information from the {{ExecutionGraph}} to create {{RootExceptionHistoryEntry}} instances. This enables us to handle failures that were caught while handling already another failure. The {{AdaptiveScheduler}} only deals with global fail overs for now (see the corresponding [FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler]), i.e. all failures are global failures ([~rmetzger] please correct me if I'm wrong here). Concurrent failures can still happen, though. These failures are "swallowed" in the [Restarting|https://github.com/apache/flink/blob/ca968d305a99b63162136589e1d9f6ba4c9cdd2b/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/Restarting.java#L78-L86] state. We might want to collect these failures and add it to the corresponding {{RootExceptionHistoryEntry}}. Having the {{BoundedFIFOQueue}} in the {{AdaptiveScheduler}} class makes sense to me. But there needs to be a way for the {{State}} implementation to populate that collection. Does this makes sense to you? > Add support for exception history > --------------------------------- > > Key: FLINK-21439 > URL: https://issues.apache.org/jira/browse/FLINK-21439 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 1.13.0 > Reporter: Matthias > Assignee: John Phelan > Priority: Major > Fix For: 1.13.0 > > Time Spent: 3h > Remaining Estimate: 0h > > {{SchedulerNG.requestJob}} returns an {{ExecutionGraphInfo}} that was > introduced in FLINK-21188. This {{ExecutionGraphInfo}} holds the information > about the {{ArchivedExecutionGraph}} and exception history information. > Currently, it's a list of {{ErrorInfos}}. This might change due to ongoing > work in FLINK-21190 where we might introduced a wrapper class with more > information on the failure. > The goal of this ticket is to implement the exception history for the > {{AdaptiveScheduler}}, i.e. collecting the exceptions that caused restarts. > This collection of failures should be forwarded through > {{SchedulerNG.requestJob}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)