[ 
https://issues.apache.org/jira/browse/FLINK-20833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265814#comment-17265814
 ] 

Robert Metzger commented on FLINK-20833:
----------------------------------------

Thanks a lot for providing a PoC! This makes the discussion a lot easier!

Do you know if there's already a metric for the number of exceptions, and the 
time since the last exception?
If not, it might make sense to add this as a default listener implementation?

Secondly, we are currently working on adding another scheduler. Once that is 
implemented, not all schedulers will support the ExceptionListener. I'm 
wondering whether we should move the initialization to another location (into 
the JobMaster, and then pass the listener into the scheduler factory?)

Discovering this feature will be very difficult, because of the ServiceLoader. 
Let's make sure we add this to the documentation.

Lastly, I guess we can use Flink's 
{{PluginUtils.createPluginManagerFromRootFolder(flinkConfig)}}, to use the 
Plugin mechanism. This will create a separate classloader per 
{{ExceptionListener}}, avoiding dependency conflicts with Flink's classpath (I 
haven't used this myself, but from a quick look, this seems easy to use).

> Expose pluggable interface for  exception analysis and metrics reporting in 
> Execution Graph
> -------------------------------------------------------------------------------------------
>
>                 Key: FLINK-20833
>                 URL: https://issues.apache.org/jira/browse/FLINK-20833
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.12.0
>            Reporter: Zhenqiu Huang
>            Priority: Minor
>
> For platform users of Apache flink, people usually want to classify the 
> failure reason( for example user code, networking, dependencies and etc) for 
> Flink jobs and emit metrics for those analyzed results. So that platform can 
> provide an accurate value for system reliability by distinguishing the 
> failure due to user logic from the system issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to