Hi Panagiotis, This is an excellent proposal and something everyone trying to provide "Flink as a service" needs to solve at some point. I have a couple of questions:
If I understand the proposal correctly, this is just about adding tags to the Throwable by running a tuple of (Throwable, FailureContext) through a user-defined model. 1) Should we also consider adding labels? The combination of tags and labels seems to be what most systems offer; sometimes, they offer labels only (key=value pairs) because tags can be implemented using those, but not the other way around. 2) Since we can not predict how heavy user-defined models ("listeners") are going to be, it would be great to keep the interfaces/data structures immutable so we can push things over to the I/O threads. Also, it sounds off to call the main interface a Listener since it's supposed to enhance the original throwable with additional metadata. I'd propose something along the lines of (we should have better names, this is just to outline the idea): interface FailureEnricher { ThrowableWithTagsAndLabels enrichFailure(Throwable cause, ImmutableContextualMetadataAboutTheThrowable context); } The names should change; this is just to outline the idea. 3) You're proposing to support a set of listeners. Since you're passing the mutable context around, which includes tags set by the previous listener, do you expect users to make any assumptions about the order in which listeners are executed? *@Shammon* Users may want to save results in listener, and then they can get the > historical results even jabmanager failover. Can we provide a unified > implementation for data storage requirements? I think we should explicitly state that all "listeners" are treated as stateless. I don't see any strong reason for snapshotting them. Best, D. On Sat, Mar 18, 2023 at 1:00 AM Shammon FY <zjur...@gmail.com> wrote: > Hi Panagiotis > > Thank you for starting this discussion. I think this FLIP is valuable and > can help user to analyze the causes of job failover better! > > I have two comments as follows > > 1. How about adding more job information in FailureListenerContext? For > example, job vertext, subtask, taskmanager location. And then user can do > more statistics according to different dimensions. > > 2. Users may want to save results in listener, and then they can get the > historical results even jabmanager failover. Can we provide a unified > implementation for data storage requirements? > > > Best, > shammon FY > > > On Saturday, March 18, 2023, Panagiotis Garefalakis <pga...@apache.org> > wrote: > > > Hi everyone, > > > > This FLIP [1] proposes a pluggable interface for failure handling > allowing > > users to implement custom failure logic using the plugin framework. > > Motivated by existing proposals [2] and tickets [3], this enables > use-cases > > like: assigning particular types to failures (e.g., User or System), > > emitting custom metrics per type (e.g., application or platform), even > > exposing errors to downstream consumers (e.g., notification systems). > > > > Thanks to Piotr and Anton for the initial reviews and discussions! > > > > For anyone interested, the starting point would be the FLIP [1] that I > > created, > > describing the motivation and the proposed changes (part of the core, > > runtime and web). > > > > The intuition behind this FLIP is being able to execute custom logic on > > failures by exposing a FailureListener interface. Implementation by users > > can be simply loaded to the system as Jar files. FailureListeners may > also > > decide to assign failure tags to errors (expressed as strings), > > that will then be exposed as metadata by the UI/Rest interfaces. > > > > Feedback is always appreciated! Looking forward to your thoughts! > > > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-304% > > 3A+Pluggable+failure+handling+for+Apache+Flink > > [2] > > https://docs.google.com/document/d/1pcHg9F3GoDDeVD5GIIo2wO67 > > Hmjgy0-hRDeuFnrMgT4 > > [3] https://issues.apache.org/jira/browse/FLINK-20833 > > > > Cheers, > > Panagiotis > > >