[ https://issues.apache.org/jira/browse/FLINK-21439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312898#comment-17312898 ]
John Phelan commented on FLINK-21439: ------------------------------------- hi [~mapohl] [~rmetzger] It seems like the {{DefaultScheduler}} enqueues most exceptions that reach either of its methods: {{handleGlobalFailure}} or {{updateTaskExecutionState}} So does it make sense that {{AdaptiveScheduler}} have the same behavior? What do you think of putting the entries into the {{ExceptionHistory}} using the existing {{ExceptionHistoryEntry}} static {{from*}} methods? This would mean that the exception history returned by the REST API would not have a distinction between different {{ExecutionGraphs}}. Maybe that is adequate for now? I could see us choosing to distinguish new {{ExecutionGraphs}} (rescalings) in multiple ways - perhaps a red "rescaled" time axis partition line in the GUI or alternatively more information in each HistoryEntry. h1. Test Plan I have pushed an initial failing test case to https://github.com/bytesandwich/flink/commit/1947509da9e8a8ed08805d84668f73a31c6570ad#diff-feebf3ead09172b03a2c89ca60935e48e23ed2c6d3a83f5486517c03790a76c9R677 and it includes a few similar test cases (WIP) h1. Implementation Plan # add {{BoundedFIFOQueue}} to {{AdaptiveScheduler}} along with {{MAX_EXCEPTION_HISTORY_SIZE}} # in {{AdaptiveScheduler}}::{{handleGlobalFailure, updateTaskExecutionState}} retrieve and enqueue the appropriate entry based off of what's available from the current {{state}} * ExceptionHistoryEntry.fromGlobalFailure * ExceptionHistoryEntry.fromFailedExecution # implement {{AdaptiveScheduler.getExceptionHistory}} and add this to {{AdaptiveScheduler.requestJob}} Getting the task name {code:java} Optional.ofNullable(executionGraph.getRegisteredExecutions().get(TaskExecutionTransition.getID())).map((execution) -> {execution.getVertex().getTaskNameWithSubtaskIndex())}); {code} > Add support for exception history > --------------------------------- > > Key: FLINK-21439 > URL: https://issues.apache.org/jira/browse/FLINK-21439 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 1.13.0 > Reporter: Matthias > Assignee: John Phelan > Priority: Major > Fix For: 1.13.0 > > > {{SchedulerNG.requestJob}} returns an {{ExecutionGraphInfo}} that was > introduced in FLINK-21188. This {{ExecutionGraphInfo}} holds the information > about the {{ArchivedExecutionGraph}} and exception history information. > Currently, it's a list of {{ErrorInfos}}. This might change due to ongoing > work in FLINK-21190 where we might introduced a wrapper class with more > information on the failure. > The goal of this ticket is to implement the exception history for the > {{AdaptiveScheduler}}, i.e. collecting the exceptions that caused restarts. > This collection of failures should be forwarded through > {{SchedulerNG.requestJob}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)