[ https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17269092#comment-17269092 ]
Matthias commented on FLINK-6042: --------------------------------- We have two approach (which we discussed offline) to implement this feature: # The {{JobExceptionsHandler}} does most of the work by iterating over the {{ArchivedExecution}}s of the passed {{ArchivedExecutionGraph}}. {{ArchivedExecutions}} provide the time (through {{ArchivedExecution.stateTimestamps}}) and the thrown exception ({{ArchivedExecution.failureCause}}). The {{SchedulerNG}} implementation would need to collect a mapping of {{failureCause}} to {{ExecutionAttemptID}} and pass it over to the {{JobExceptionsHandler}} along the {{ArchivedExecutionGraph}}. This would enable the handler to group exceptions happened due to the same failure case. +Pros:+ - This approach has the advantage of using mostly code that is already there. - No extra code in the {{SchedulerBase}} implementation. +Cons:+ - It does not support restarts of the {{ExecutionGraph}}. This restart functionality is planned for the declarative scheduler which we're currently working on (see [FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Declarative+Scheduler]). Only the most recent {{ExecutionGraph}} (and, therefore, its exceptions) is provided. - There might be modifications necessary to the internally used data structures allowing random access based on {{ExecutionAttemptID}} instead of iterating over collections. # The collection of exceptions happens in the scheduler. The mapping of root cause to related exceptions is then passed over to the {{JobExceptionsHandler}}. The exceptions can be collected as they appear. +Pros:+ - It makes makes it easier to port this functionality into the declarative scheduler of FLIP-160. We don't need to think of a history of {{ArchivedExecutionGraphs}} for now. Restart of the {{ExecutionGraph}} are hidden away from the {{JobExceptionsHandler}} +Cons:+ - The {{SchedulerBase}} code base grows once more which increases complexity. We decided to go with option 2 for now. This makes it easier for us to implement the functionality into the declarative scheduler of FLIP-160. > Display last n exceptions/causes for job restarts in Web UI > ----------------------------------------------------------- > > Key: FLINK-6042 > URL: https://issues.apache.org/jira/browse/FLINK-6042 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination, Runtime / Web Frontend > Affects Versions: 1.3.0 > Reporter: Till Rohrmann > Assignee: Matthias > Priority: Major > Labels: pull-request-available > > Users requested that it would be nice to see the last {{n}} exceptions > causing a job restart in the Web UI. This will help to more easily debug and > operate a job. > We could store the root causes for failures similar to how prior executions > are stored in the {{ExecutionVertex}} using the {{EvictingBoundedList}} and > then serve this information via the Web UI. -- This message was sent by Atlassian Jira (v8.3.4#803005)