Hi Kevin, I haven't worked with Bugsnag. So, I cannot give more input on that one. For Flink, exceptions are handled by the job's scheduler. Flink collects these exceptions in some bounded queue called the exception history [1]. It collects task failures but also global failures which make the job fail or restart. The size of the queue can be set through web.exception-history-size [2]. Additionally, there's the FatalErrorHandler interface [3] which is used for fatal (i.e. unrecoverable) errors of the ecosystem. You might want to have a look at the implementations of this interface.
I hope that helps a bit. Matthias [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/rest_api/#jobs-jobid-exceptions [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/#web-exception-history-size [3] https://github.com/apache/flink/blob/239067c8d6393a273acaa2a3c3f57ad1c5486e3a/flink-rpc/flink-rpc-core/src/main/java/org/apache/flink/runtime/rpc/FatalErrorHandler.java#L22 On Mon, Jun 21, 2021 at 3:12 PM Kevin Lam <kevin....@shopify.com> wrote: > Hi all, > > I'm interested in instrumenting an Apache Flink application so that we can > monitor exceptions. I was wondering what the best practices are here? Is > there a good way to observe all the exceptions inside of a Flink > application, including Flink internals? > > We are currently thinking of using Bugsnag, which has some steps to > integrate with java applications: > https://docs.bugsnag.com/platforms/java/other/, which works fine for > uncaught exceptions in the job manager / pipeline driver context, but > doesn't catch anything outside of that. > > We're also interested in reporting on exceptions that occur in the job > execution context, eg. in task managers. > > Any tips/suggestions? I'd love to learn more about exception tracking and > handling in Flink :) > > (reposting because it looks like my other thread got deleted?) >