Thanks for bringing this up Hong! Classifying exceptions was also the main driving factor behind pluggable failure enrichers <https://issues.apache.org/jira/browse/FLINK-31508>. However, we could do a much better job maintaining a hierarchy of System and User exceptions thus making the classification logic more straightforward.
- Defining better system/user exceptions with some kind of hierarchy is definitely a step forward (and refactoring the existing ones) - Classloader filtering could definitely be used for discovering errors originating from user defined code, see doc <https://docs.google.com/document/d/1pcHg9F3GoDDeVD5GIIo2wO67Hmjgy0-hRDeuFnrMgT4/edit#heading=h.ato31xdnm7nk> - Eventually we could also release a simple failure enricher using the above improvements to automatically classify errors on JMs exceptions endpoint Cheers, Panagiotis On Wed, May 31, 2023 at 9:12 PM Paul Lam <paullin3...@gmail.com> wrote: > Hi Hong, > > Thanks for starting the discussion! I believe the exception classification > between > user exceptions and system exceptions has been long-awaited. > > It's worth mentioning that years ago there was a related discussion [1], > FYI. > > I’m in favor of the heuristic approach to classify the exceptions by which > classloader it comes from. In addition, we could introduce extra > configurations > to allow manual execution classification based on the package name of > exceptions. > > [1] https://lists.apache.org/thread/gms4nysnb3o4v2k6421m5hsq0g7gtr81 > > Best, > Paul Lam > > > 2023年5月25日 23:07,Teoh, Hong <lian...@amazon.co.uk.INVALID> 写道: > > > > Hi all, > > > > This discussion thread is to gauge community opinion and gather feedback > on implementing a better exception hierarchy in Flink to identify > exceptions that come from running “User job code” and exceptions coming > from “Flink engine code”. > > > > Problem: > > Flink provides a distributed processing engine (SYSTEM) to run a data > streaming job (USER). There are many places in code where the engine runs > “user job provided java classes”, such as serialization/deserialization, > configuration objects, credential loading, running setup() method on > certain Operators. > > Sometimes when evaluating a stack trace, it might be hard to > automatically determine if an exception is arising out of a Flink engine > problem, or a problem associated to a particular job. > > > > Proposed way forward: > > - It would be good to have an exception hierarchy maintained by Flink > that separates out the exceptions arising from running “USER provided > classes”. That way, we can improve our ability to automatically classify > and mitigate these exceptions. > > - We could also include separating out the places where exception > originates based on function - FlinkSerializationException, > FlinkConfigurationException.. etc. (we already have a similar concept with > IncompatibleKeysException) > > - This has synergy with FLIP-304: Pluggable Failure Enrichers (since it > would simplify the logic in the USER/SYSTEM classifier there) [1]. > > - In addition, this has been discussed before in the context of updating > the exception thrown by serialisers to be a Flink-specific serialisation > exception instead of IllegalStateException [2] > > > > > > Any thoughts on the above? > > > > Regards, > > Hong > > > > > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers > > [2] https://lists.apache.org/thread/0o859h1vdx6mwv0fqvmybpn574692jtg > > > > > >