Thanks for bringing this up Hong!

Classifying exceptions was also the main driving factor behind pluggable
failure enrichers <https://issues.apache.org/jira/browse/FLINK-31508>.
However, we could do a much better job maintaining a hierarchy of System
and User exceptions thus making the classification logic more
straightforward.

   - Defining better system/user exceptions with some kind of hierarchy is
   definitely a step forward (and refactoring the existing ones)
   - Classloader filtering could definitely be used for discovering errors
   originating from user defined code, see doc
   
<https://docs.google.com/document/d/1pcHg9F3GoDDeVD5GIIo2wO67Hmjgy0-hRDeuFnrMgT4/edit#heading=h.ato31xdnm7nk>
   - Eventually we could also release a simple failure enricher using the
   above improvements to automatically classify errors on JMs exceptions
   endpoint

Cheers,
Panagiotis

On Wed, May 31, 2023 at 9:12 PM Paul Lam <paullin3...@gmail.com> wrote:

> Hi Hong,
>
> Thanks for starting the discussion! I believe the exception classification
> between
> user exceptions and system exceptions has been long-awaited.
>
> It's worth mentioning that years ago there was a related discussion [1],
> FYI.
>
> I’m in favor of the heuristic approach to classify the exceptions by which
> classloader it comes from. In addition, we could introduce extra
> configurations
> to allow manual execution classification based on the package name of
> exceptions.
>
> [1] https://lists.apache.org/thread/gms4nysnb3o4v2k6421m5hsq0g7gtr81
>
> Best,
> Paul Lam
>
> > 2023年5月25日 23:07,Teoh, Hong <lian...@amazon.co.uk.INVALID> 写道:
> >
> > Hi all,
> >
> > This discussion thread is to gauge community opinion and gather feedback
> on implementing a better exception hierarchy in Flink to identify
> exceptions that come from running “User job code” and exceptions coming
> from “Flink engine code”.
> >
> > Problem:
> > Flink provides a distributed processing engine (SYSTEM) to run a data
> streaming job (USER). There are many places in code where the engine runs
> “user job provided java classes”, such as serialization/deserialization,
> configuration objects, credential loading, running setup() method on
> certain Operators.
> > Sometimes when evaluating a stack trace, it might be hard to
> automatically determine if an exception is arising out of a Flink engine
> problem, or a problem associated to a particular job.
> >
> > Proposed way forward:
> > - It would be good to have an exception hierarchy maintained by Flink
> that separates out the exceptions arising from running “USER provided
> classes”. That way, we can improve our ability to automatically classify
> and mitigate these exceptions.
> > - We could also include separating out the places where exception
> originates based on function - FlinkSerializationException,
> FlinkConfigurationException.. etc. (we already have a similar concept with
> IncompatibleKeysException)
> > - This has synergy with FLIP-304: Pluggable Failure Enrichers (since it
> would simplify the logic in the USER/SYSTEM classifier there) [1].
> > - In addition, this has been discussed before in the context of updating
> the exception thrown by serialisers to be a Flink-specific serialisation
> exception instead of IllegalStateException [2]
> >
> >
> > Any thoughts on the above?
> >
> > Regards,
> > Hong
> >
> >
> > [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers
> > [2] https://lists.apache.org/thread/0o859h1vdx6mwv0fqvmybpn574692jtg
> >
> >
>
>

Reply via email to