Thanks for the engagement on the thread! Sorry for the late reply, was off on holidays for a bit.
@Paul Thanks for linking the historical discussion. Yes I would agree that using classloading to determine if the exception type has come from a User classloader rather than System classloader would be helpful. In my opinion, we should enhance this further by also introducing a good exception hierarchy depending on where the USER code was called. However, I also note that this might be a breaking change for some, because they might rely on the current exception type for job management. We could address this by wrapping the existing exception rather than replacing. @Panagiotis I agree with all your points. This proposal is in synergy with Pluggable Failure Enrichers. Regards, Hong > On 6 Jun 2023, at 06:50, Panagiotis Garefalakis <pga...@apache.org> wrote: > > CAUTION: This email originated from outside of the organization. Do not click > links or open attachments unless you can confirm the sender and know the > content is safe. > > > > Thanks for bringing this up Hong! > > Classifying exceptions was also the main driving factor behind pluggable > failure enrichers <https://issues.apache.org/jira/browse/FLINK-31508>. > However, we could do a much better job maintaining a hierarchy of System > and User exceptions thus making the classification logic more > straightforward. > > - Defining better system/user exceptions with some kind of hierarchy is > definitely a step forward (and refactoring the existing ones) > - Classloader filtering could definitely be used for discovering errors > originating from user defined code, see doc > > <https://docs.google.com/document/d/1pcHg9F3GoDDeVD5GIIo2wO67Hmjgy0-hRDeuFnrMgT4/edit#heading=h.ato31xdnm7nk> > - Eventually we could also release a simple failure enricher using the > above improvements to automatically classify errors on JMs exceptions > endpoint > > Cheers, > Panagiotis > > On Wed, May 31, 2023 at 9:12 PM Paul Lam <paullin3...@gmail.com> wrote: > >> Hi Hong, >> >> Thanks for starting the discussion! I believe the exception classification >> between >> user exceptions and system exceptions has been long-awaited. >> >> It's worth mentioning that years ago there was a related discussion [1], >> FYI. >> >> I’m in favor of the heuristic approach to classify the exceptions by which >> classloader it comes from. In addition, we could introduce extra >> configurations >> to allow manual execution classification based on the package name of >> exceptions. >> >> [1] https://lists.apache.org/thread/gms4nysnb3o4v2k6421m5hsq0g7gtr81 >> >> Best, >> Paul Lam >> >>> 2023年5月25日 23:07,Teoh, Hong <lian...@amazon.co.uk.INVALID> 写道: >>> >>> Hi all, >>> >>> This discussion thread is to gauge community opinion and gather feedback >> on implementing a better exception hierarchy in Flink to identify >> exceptions that come from running “User job code” and exceptions coming >> from “Flink engine code”. >>> >>> Problem: >>> Flink provides a distributed processing engine (SYSTEM) to run a data >> streaming job (USER). There are many places in code where the engine runs >> “user job provided java classes”, such as serialization/deserialization, >> configuration objects, credential loading, running setup() method on >> certain Operators. >>> Sometimes when evaluating a stack trace, it might be hard to >> automatically determine if an exception is arising out of a Flink engine >> problem, or a problem associated to a particular job. >>> >>> Proposed way forward: >>> - It would be good to have an exception hierarchy maintained by Flink >> that separates out the exceptions arising from running “USER provided >> classes”. That way, we can improve our ability to automatically classify >> and mitigate these exceptions. >>> - We could also include separating out the places where exception >> originates based on function - FlinkSerializationException, >> FlinkConfigurationException.. etc. (we already have a similar concept with >> IncompatibleKeysException) >>> - This has synergy with FLIP-304: Pluggable Failure Enrichers (since it >> would simplify the logic in the USER/SYSTEM classifier there) [1]. >>> - In addition, this has been discussed before in the context of updating >> the exception thrown by serialisers to be a Flink-specific serialisation >> exception instead of IllegalStateException [2] >>> >>> >>> Any thoughts on the above? >>> >>> Regards, >>> Hong >>> >>> >>> [1] >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers >>> [2] https://lists.apache.org/thread/0o859h1vdx6mwv0fqvmybpn574692jtg >>> >>> >> >>