Thanks for the engagement on the thread! Sorry for the late reply, was off on 
holidays for a bit.

@Paul

Thanks for linking the historical discussion. Yes I would agree that using 
classloading to determine if the exception type has come from a User 
classloader rather than System classloader would be helpful.

In my opinion, we should enhance this further by also introducing a good 
exception hierarchy depending on where the USER code was called. However, I 
also note that this might be a breaking change for some, because they might 
rely on the current exception type for job management. We could address this by 
wrapping the existing exception rather than replacing.

@Panagiotis
I agree with all your points. This proposal is in synergy with Pluggable 
Failure Enrichers.

Regards,
Hong

> On 6 Jun 2023, at 06:50, Panagiotis Garefalakis <pga...@apache.org> wrote:
> 
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you can confirm the sender and know the 
> content is safe.
> 
> 
> 
> Thanks for bringing this up Hong!
> 
> Classifying exceptions was also the main driving factor behind pluggable
> failure enrichers <https://issues.apache.org/jira/browse/FLINK-31508>.
> However, we could do a much better job maintaining a hierarchy of System
> and User exceptions thus making the classification logic more
> straightforward.
> 
>   - Defining better system/user exceptions with some kind of hierarchy is
>   definitely a step forward (and refactoring the existing ones)
>   - Classloader filtering could definitely be used for discovering errors
>   originating from user defined code, see doc
>   
> <https://docs.google.com/document/d/1pcHg9F3GoDDeVD5GIIo2wO67Hmjgy0-hRDeuFnrMgT4/edit#heading=h.ato31xdnm7nk>
>   - Eventually we could also release a simple failure enricher using the
>   above improvements to automatically classify errors on JMs exceptions
>   endpoint
> 
> Cheers,
> Panagiotis
> 
> On Wed, May 31, 2023 at 9:12 PM Paul Lam <paullin3...@gmail.com> wrote:
> 
>> Hi Hong,
>> 
>> Thanks for starting the discussion! I believe the exception classification
>> between
>> user exceptions and system exceptions has been long-awaited.
>> 
>> It's worth mentioning that years ago there was a related discussion [1],
>> FYI.
>> 
>> I’m in favor of the heuristic approach to classify the exceptions by which
>> classloader it comes from. In addition, we could introduce extra
>> configurations
>> to allow manual execution classification based on the package name of
>> exceptions.
>> 
>> [1] https://lists.apache.org/thread/gms4nysnb3o4v2k6421m5hsq0g7gtr81
>> 
>> Best,
>> Paul Lam
>> 
>>> 2023年5月25日 23:07,Teoh, Hong <lian...@amazon.co.uk.INVALID> 写道:
>>> 
>>> Hi all,
>>> 
>>> This discussion thread is to gauge community opinion and gather feedback
>> on implementing a better exception hierarchy in Flink to identify
>> exceptions that come from running “User job code” and exceptions coming
>> from “Flink engine code”.
>>> 
>>> Problem:
>>> Flink provides a distributed processing engine (SYSTEM) to run a data
>> streaming job (USER). There are many places in code where the engine runs
>> “user job provided java classes”, such as serialization/deserialization,
>> configuration objects, credential loading, running setup() method on
>> certain Operators.
>>> Sometimes when evaluating a stack trace, it might be hard to
>> automatically determine if an exception is arising out of a Flink engine
>> problem, or a problem associated to a particular job.
>>> 
>>> Proposed way forward:
>>> - It would be good to have an exception hierarchy maintained by Flink
>> that separates out the exceptions arising from running “USER provided
>> classes”. That way, we can improve our ability to automatically classify
>> and mitigate these exceptions.
>>> - We could also include separating out the places where exception
>> originates based on function - FlinkSerializationException,
>> FlinkConfigurationException.. etc. (we already have a similar concept with
>> IncompatibleKeysException)
>>> - This has synergy with FLIP-304: Pluggable Failure Enrichers (since it
>> would simplify the logic in the USER/SYSTEM classifier there) [1].
>>> - In addition, this has been discussed before in the context of updating
>> the exception thrown by serialisers to be a Flink-specific serialisation
>> exception instead of IllegalStateException [2]
>>> 
>>> 
>>> Any thoughts on the above?
>>> 
>>> Regards,
>>> Hong
>>> 
>>> 
>>> [1]
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers
>>> [2] https://lists.apache.org/thread/0o859h1vdx6mwv0fqvmybpn574692jtg
>>> 
>>> 
>> 
>> 

Reply via email to