[ 
https://issues.apache.org/jira/browse/FLINK-21439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314927#comment-17314927
 ] 

Matthias edited comment on FLINK-21439 at 4/5/21, 3:27 PM:
-----------------------------------------------------------

Hi John,
thank you for your proposal. You correctly identified 
{{SchedulerNG.handleGlobalFailure}} and 
{{SchedulerNG.updateTaskExecutionState}} as the entry points for failure 
handling. This also apply to the {{AdaptiveScheduler}}.

About proposing to use the {{ExceptionHistoryEntry}}'s static {{from*}} factory 
methods: There was some work done as part of FLINK-21189 that got recently 
merged. I should have pinged you on that one. Sorry for that. A new class 
{{ExceptionHistoryEntryExtractor}} was introduced that deals with collecting 
all relevant information from the {{ExecutionGraph}} to create 
{{RootExceptionHistoryEntry}} instances. This enables us to handle failures 
that were caught while handling already another failure.

The {{AdaptiveScheduler}} only deals with global fail overs for now (see the 
corresponding 
[FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler]),
 i.e. all failures are global failures ([~rmetzger] please correct me if I'm 
wrong here). Concurrent failures can still happen, though. These failures are 
"swallowed" in the 
[Restarting|https://github.com/apache/flink/blob/ca968d305a99b63162136589e1d9f6ba4c9cdd2b/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/Restarting.java#L78-L86]
 state. We might want to collect these failures and add it to the corresponding 
{{RootExceptionHistoryEntry}}.

Having the {{BoundedFIFOQueue}} in the {{AdaptiveScheduler}} class makes sense 
to me. But there needs to be a way for the {{State}} implementation to populate 
that collection.

Does this makes sense to you?


was (Author: mapohl):
Hi John,
thank you for your proposal. You correctly identified 
{{SchedulerNG.handleGlobalFailure}} or {{SchedulerNG.updateTaskExecutionState}} 
as the entry points for failure handling. This also apply to the 
{{AdaptiveScheduler}}.

About proposing to use the {{ExceptionHistoryEntry}}'s static {{from*}} factory 
methods: There was some work done as part of FLINK-21189 that got recently 
merged. I should have pinged you on that one. Sorry for that. A new class 
{{ExceptionHistoryEntryExtractor}} was introduced that deals with collecting 
all relevant information from the {{ExecutionGraph}} to create 
{{RootExceptionHistoryEntry}} instances. This enables us to handle failures 
that were caught while handling already another failure.

The {{AdaptiveScheduler}} only deals with global fail overs for now (see the 
corresponding 
[FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler]),
 i.e. all failures are global failures ([~rmetzger] please correct me if I'm 
wrong here). Concurrent failures can still happen, though. These failures are 
"swallowed" in the 
[Restarting|https://github.com/apache/flink/blob/ca968d305a99b63162136589e1d9f6ba4c9cdd2b/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/Restarting.java#L78-L86]
 state. We might want to collect these failures and add it to the corresponding 
{{RootExceptionHistoryEntry}}.

Having the {{BoundedFIFOQueue}} in the {{AdaptiveScheduler}} class makes sense 
to me. But there needs to be a way for the {{State}} implementation to populate 
that collection.

Does this makes sense to you?

> Add support for exception history
> ---------------------------------
>
>                 Key: FLINK-21439
>                 URL: https://issues.apache.org/jira/browse/FLINK-21439
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.13.0
>            Reporter: Matthias
>            Assignee: John Phelan
>            Priority: Major
>             Fix For: 1.13.0
>
>          Time Spent: 3h
>  Remaining Estimate: 0h
>
> {{SchedulerNG.requestJob}} returns an {{ExecutionGraphInfo}} that was 
> introduced in FLINK-21188. This {{ExecutionGraphInfo}} holds the information 
> about the {{ArchivedExecutionGraph}} and exception history information. 
> Currently, it's a list of {{ErrorInfos}}. This might change due to ongoing 
> work in FLINK-21190 where we might introduced a wrapper class with more 
> information on the failure.
> The goal of this ticket is to implement the exception history for the 
> {{AdaptiveScheduler}}, i.e. collecting the exceptions that caused restarts. 
> This collection of failures should be forwarded through 
> {{SchedulerNG.requestJob}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to