[jira] [Commented] (FLINK-6042) Display last n exceptions/causes for job restarts in Web UI

Matthias (Jira) Wed, 20 Jan 2021 23:32:07 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17269092#comment-17269092
 ]


Matthias commented on FLINK-6042:
---------------------------------

We have two approach (which we discussed offline) to implement this feature:
 # The {{JobExceptionsHandler}} does most of the work by iterating over the 
{{ArchivedExecution}}s of the passed {{ArchivedExecutionGraph}}. 
{{ArchivedExecutions}} provide the time (through 
{{ArchivedExecution.stateTimestamps}}) and the thrown exception 
({{ArchivedExecution.failureCause}}). The {{SchedulerNG}} implementation would 
need to collect a mapping of {{failureCause}} to {{ExecutionAttemptID}} and 
pass it over to the {{JobExceptionsHandler}} along the 
{{ArchivedExecutionGraph}}. This would enable the handler to group exceptions 
happened due to the same failure case.
+Pros:+ 
- This approach has the advantage of using mostly code that is already there.
- No extra code in the {{SchedulerBase}} implementation.
+Cons:+ 
- It does not support restarts of the {{ExecutionGraph}}. This restart 
functionality is planned for the declarative scheduler which we're currently 
working on (see 
[FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Declarative+Scheduler]).
 Only the most recent {{ExecutionGraph}} (and, therefore, its exceptions) is 
provided.
- There might be modifications necessary to the internally used data structures 
allowing random access based on {{ExecutionAttemptID}} instead of iterating 
over collections.
 # The collection of exceptions happens in the scheduler. The mapping of root 
cause to related exceptions is then passed over to the 
{{JobExceptionsHandler}}. The exceptions can be collected as they appear.
+Pros:+ 
- It makes makes it easier to port this functionality into the declarative 
scheduler of FLIP-160. We don't need to think of a history of 
{{ArchivedExecutionGraphs}} for now. Restart of the {{ExecutionGraph}} are 
hidden away from the {{JobExceptionsHandler}} 
+Cons:+
- The {{SchedulerBase}} code base grows once more which increases complexity.

We decided to go with option 2 for now. This makes it easier for us to 
implement the functionality into the declarative scheduler of FLIP-160.

> Display last n exceptions/causes for job restarts in Web UI
> -----------------------------------------------------------
>
>                 Key: FLINK-6042
>                 URL: https://issues.apache.org/jira/browse/FLINK-6042
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination, Runtime / Web Frontend
>    Affects Versions: 1.3.0
>            Reporter: Till Rohrmann
>            Assignee: Matthias
>            Priority: Major
>              Labels: pull-request-available
>
> Users requested that it would be nice to see the last {{n}} exceptions 
> causing a job restart in the Web UI. This will help to more easily debug and 
> operate a job.
> We could store the root causes for failures similar to how prior executions 
> are stored in the {{ExecutionVertex}} using the {{EvictingBoundedList}} and 
> then serve this information via the Web UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-6042) Display last n exceptions/causes for job restarts in Web UI

Reply via email to