Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

David Morávek Thu, 23 Mar 2023 00:50:27 -0700

>
> One additional remark on introducing it as an async operation: We would
> need a new configuration parameter to define the timeout for such a
> listener call, wouldn't we?
>


This could be left up to the implementor to handle.

What about adding an extra method getNamespace() to the Listener interface
> which returns an Optional<String>.
>

I'd avoid mixing an additional concept into this. We can simply have a new
method that returns a set of keys the listener can output. We can validate
this at the JM startup time and fail fast (since it's a configuration
error) if there is an overlap. If the listener outputs the key that is not
allowed to, I wouldn't be afraid to call into a fatal error handler since
it's an invalid implementation.

Best,
D.

On Thu, Mar 23, 2023 at 8:34 AM Matthias Pohl
<[email protected]> wrote:

> Sounds good. Two points I want to add:
>
>    - Listener execution should be independent — however we need a way to
> > enforce a Label key/key-prefix is only assigned to a single Listener,
> > thinking of a validation step both at Listener init and runtime stages
> >
> What about adding an extra method getNamespace() to the Listener interface
> which returns an Optional<String>. Therefore, the implementation/the user
> can decide depending on the use case whether it's necessary to have
> separate namespaces for the key/value pairs or not. On the Flink side, we
> would just merge the different maps considering their namespaces.
>
> A flaw of this approach is that if a user decides to use the same namespace
> for multiple listeners, how is an error in one of the listeners represented
> in the outcome? We would have to overwrite either the successful listener's
> result or the failed ones. I wanted to share it, anyway.
>
> One additional remark on introducing it as an async operation: We would
> need a new configuration parameter to define the timeout for such a
> listener call, wouldn't we?
>
> Matthias
>
> On Wed, Mar 22, 2023 at 4:56 PM Panagiotis Garefalakis <[email protected]>
> wrote:
>
> > Hi everyone,
> >
> >
> > Thanks for the valuable comments!
> > Excited to see this is an area of interest for the community!
> >
> > Summarizing some of the main points raised along with my thoughts:
> >
> >    - Labels (Key/Value) pairs are more expressive than Tags (Strings) so
> >    using the former is a good idea — I am also debating if we want to
> > return
> >    multiple KV pairs per Listener (one could argue that we could split
> the
> >    logic in multiple Listeners to support that)
> >    - An immutable context along with data returned using the interface
> >    method implementations is a better approach than a mutable Collection
> >    - Listener execution should be independent — however we need a way to
> >    enforce a Label key/key-prefix is only assigned to a single Listener,
> >    thinking of a validation step both at Listener init and runtime stages
> >    - We want to perform async Listener operations as sync could block the
> >    main thread — exposing an ioExecutor pool through the context could be
> > an
> >    elegant solution here
> >    - Make sure Listener errors are not failing jobs — make sure to log
> and
> >    keep the job alive
> >    - We need better naming / public interface separation/description
> >
> >         -  Even though custom restart strategies share some properties
> with
> > Listeners, they would probably need a separate interface with a different
> > return type anyway (restart strategy not labels) and in general they are
> > different and complex enough to justify their own FLIP (that can also be
> a
> > follow-up).
> >
> >
> > What do people think? I am planning to modify the FLIP to reflect these
> > changes if they make sense to everyone.
> >
> > Cheers,
> > Panagiotis
> >
> > On Wed, Mar 22, 2023 at 6:28 AM Hong Teoh <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > Thank you Panagiotis for proposing this. From the size of the thread,
> > this
> > > is a much needed feature in Flink!
> > > Some thoughts, to extend those already adeptly summarised by Piotr,
> > > Matthias and Jing.
> > >
> > > - scope of FLIP: +1 to scoping this FLIP to observability around a
> > > restart. That would include adding metadata + exposing metadata to
> > external
> > > systems. IMO, introducing a new restart strategy solves different
> > problems,
> > > is much larger scope and should be covered in a separate FLIP.
> > >
> > > - failure handling: At the moment, we propose transitioning the Flink
> job
> > > to a terminal FAILED state when JobListener fails, when the job could
> > have
> > > transitioned to RESTARTING->RUNNING. If we are keeping in line with the
> > > scope to add metadata/observability around job restarts, we should not
> be
> > > affecting the running of the Flink job itself. Could I propose we
> instead
> > > log WARN/ERROR.
> > >
> > > - immutable context: +1 to keeping the contract clear via return types.
> > > - async operation: +1 to adding ioexecutor to context, however, given
> we
> > > don’t want to block the actual job restart on adding metadata / calling
> > > external services, should we consider returning and letting futures
> > > complete independently?
> > >
> > > - independent vs ordered execution: Should we consider making the order
> > of
> > > execution deterministic (use a List instead of Set)?
> > >
> > >
> > > Once again, thank you for working on this.
> > >
> > > Regards,
> > > Hong
> > >
> > >
> > > > On 21 Mar 2023, at 21:07, Jing Ge <[email protected]>
> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Thanks Panagiotis for this FLIP and thanks for all valuable
> > discussions.
> > > > I'd like to share my two cents:
> > > >
> > > > - FailureListenerContext#addTag and FailureListenerContext#getTags.
> It
> > > > seems that we have to call getTags() and then do remove activities if
> > we
> > > > want to delete any tags (according to the javadoc in the FLIP).  It
> is
> > > > inconsistent for me too. Either offer addTag(), deleteTag(), and let
> > > > getTags() return immutable collection, or offer getTags() only to
> > return
> > > > mutable collection.
> > > >
> > > > - label vs tag. Label is a great idea +1. AFAIC, tag could be a
> special
> > > > case of label, i.e. key="tag". It is convenient to offer the xxxTag()
> > > > method if the user only needs one label. I would love to have both of
> > > them.
> > > > Another thought is that tag implicitly contains the meaning of
> > > "immutable".
> > > >
> > > > - +1 for a separate FLIP of customized restart strategy. Attention
> > should
> > > > be taken to make sure it works well with Flink built-in
> restartStrategy
> > > in
> > > > order to have the single source of truth.
> > > >
> > > > - execution order. The default independent execution should be fine.
> > > > According to the FailureListener interface definition in the FLIP,
> > users
> > > > should be able to easily build a listener chain[1] to offer
> sequential
> > > > execution, e.g. public FailureListener(FailureListener nextListener).
> > > > Another option is to modify the interface or provide another
> interface
> > > > alongside the current one to extend the method to support
> > ListenerChain,
> > > > i.e. void onFailure(Throwable cause, FailureListenerContext context,
> > > > ListenerChain listenerChain). Users can also mix them up.
> > > >
> > > > - naming. Afaiu, the pluggable extension is not limited to failure
> > > > enrichment. Conceptually it can do everything for the given failure,
> > e.g.
> > > > start counting metric as the FLIP described, calling an external
> > system,
> > > > sending notification to slack channel, etc. you name it. It sounds to
> > me
> > > > more like a FailureActionListener - it can trigger actions based on
> > > > failure. Failure enrichment is one type of action.
> > > >
> > > > Best regards,
> > > > Jing
> > > >
> > > > [1] https://en.wikipedia.org/wiki/Chain-of-responsibility_pattern
> > > >
> > > > On Tue, Mar 21, 2023 at 3:39 PM Matthias Pohl
> > > > <[email protected]> wrote:
> > > >
> > > >> Thanks for the proposal, Panagiotis. A lot of good points have been
> > > already
> > > >> shared. I just want to add my view on some of the items:
> > > >>
> > > >> - independent execution vs ordered execution: I prefer the listeners
> > > being
> > > >> processed independently from each other because it adds less
> > complexity
> > > >> code-wise. The use case Piotr described (where you want to reuse
> some
> > > other
> > > >> classifier) is the only one I can think of where we actually need
> > > >> classifiers depending on each other. Supporting such a use case
> right
> > > from
> > > >> the start feels a bit over-engineered and could be covered in a
> > > follow-up
> > > >> FLIP if we really come to that point where such a feature is
> requested
> > > by
> > > >> users.
> > > >>
> > > >> - key/value pairs instead of plain labels: I think that's a good
> idea.
> > > >> key/value pairs are more expressive. +1
> > > >>
> > > >> - extending the FLIP to cover restart strategy: I understand Gen's
> > > concern
> > > >> about introducing too many different types of plugins. But I would
> > still
> > > >> favor not extending the FLIP in this regard. A pluggable restart
> > > strategy
> > > >> sounds reasonable. But an error classifier and a restart strategy
> are
> > > still
> > > >> different enough to justify separate plugins, IMHO. And therefore, I
> > > would
> > > >> think that covering the restart strategy in a separate FLIP is the
> > > better
> > > >> option for the sake of simplicity.
> > > >>
> > > >> - immutable context: Passing in an immutable context and returning
> > data
> > > >> through the interface method's return value sounds like a better
> > > approach
> > > >> to harden the contract of the interface. +1 for that proposal
> > > >>
> > > >> - async operation: I think David is right. An async interface makes
> > the
> > > >> listener implementations more robust when it comes to heavy IO
> > > operations.
> > > >> The ioExecutor can be passed through the context object. +1
> > > >>
> > > >> Matthias
> > > >>
> > > >> On Tue, Mar 21, 2023 at 2:09 PM David Morávek <
> > [email protected]>
> > > >> wrote:
> > > >>
> > > >>> *@Piotr*
> > > >>>
> > > >>>
> > > >>>> I was thinking about actually defining the order of the
> > > >>>> classifiers/handlers and not allowing them to be asynchronous.
> > > >>>> Asynchronousity would create some problems: when to actually
> return
> > > the
> > > >>>> error to the user? After all async responses will get back?
> Before,
> > > but
> > > >>>> without classified exception? It would also add implementation
> > > >> complexity
> > > >>>> and I think we can always expand the API with async version in the
> > > >> future
> > > >>>> if needed.
> > > >>>
> > > >>>
> > > >>> As long as the classifiers need to talk to an external system, we
> by
> > > >>> definition need to allow them to be asynchronous to unblock the
> main
> > > >> thread
> > > >>> for handling other RPCs. Exposing ioExecutor via the context
> proposed
> > > >> above
> > > >>> would be great.
> > > >>>
> > > >>> After all async responses will get back
> > > >>>
> > > >>>
> > > >>> This would be the same if we trigger them synchronously one by one,
> > > with
> > > >> a
> > > >>> caveat that synchronous execution might take significantly longer
> and
> > > >>> introduce unnecessary downtime to a job.
> > > >>>
> > > >>> D.
> > > >>>
> > > >>> On Tue, Mar 21, 2023 at 1:12 PM Zhu Zhu <[email protected]> wrote:
> > > >>>
> > > >>>> Hi Piotr,
> > > >>>>
> > > >>>> It's fine to me to have a separate FLIP to extend this
> > > >> `FailureListener`
> > > >>>> to support custom restart strategy.
> > > >>>>
> > > >>>> What I was a bit concerned is that if we just treat the
> > > >> `FailureListener`
> > > >>>> as an error classifier which is not crucial to Flink framework
> > > process,
> > > >>>> we may design it to run asynchronously and not trigger Flink
> > failures.
> > > >>>> This may be a blocker if later we want to enable it to support
> > custom
> > > >>>> restart strategy.
> > > >>>>
> > > >>>> Thanks,
> > > >>>> Zhu
> > > >>>>
> > > >>>> Dian Fu <[email protected]> 于2023年3月21日周二 19:53写道：
> > > >>>>>
> > > >>>>> Hi Panagiotis,
> > > >>>>>
> > > >>>>> Thanks for the proposal. This is a very valuable feature and will
> > be
> > > >> a
> > > >>>> good
> > > >>>>> add-on for Flink.
> > > >>>>>
> > > >>>>> I also think that it will be great if we can consider how to make
> > it
> > > >>>>> possible for users to customize the failure handling in this
> FLIP.
> > > >> It's
> > > >>>>> highly related to the problem we want to address in this FLIP and
> > > >> could
> > > >>>>> avoid refactoring the interfaces proposed in this FLIP too
> quickly.
> > > >>>>>
> > > >>>>> Currently it treats all kinds of exceptions the same. However,
> some
> > > >>> kinds
> > > >>>>> of exceptions are actually not recoverable at all. It could let
> > users
> > > >>> to
> > > >>>>> customize the failure handling logic to fail fast for certain
> known
> > > >>>>> unrecoverable exceptions and finally make these kinds of jobs get
> > > >>> noticed
> > > >>>>> and recoveried more quickly.
> > > >>>>>
> > > >>>>> Regards,
> > > >>>>> Dian
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> On Tue, Mar 21, 2023 at 4:36 PM Gen Luo <[email protected]>
> > wrote:
> > > >>>>>
> > > >>>>>> Hi Panagiotis,
> > > >>>>>>
> > > >>>>>> Thanks for the proposal.
> > > >>>>>>
> > > >>>>>> It's useful to enrich the information so that users can be more
> > > >>>>>> clear why the job is failing, especially platform developers who
> > > >>>>>> need to provide the information to their end users.
> > > >>>>>> And for the very FLIP, I'd prefer the naming `FailureEnricher`
> > > >>>>>> proposed by David, as the plugin doesn't really handle the
> > failure.
> > > >>>>>>
> > > >>>>>> However, like Zhu and Lijie said, I also joined a discussion
> > > >>>>>> recently about customized failure handling, e.g. counting the
> > > >>>>>> failure rate of pipeline regions separately, and failing the job
> > > >>>>>> when a specific error occurs, and so on.
> > > >>>>>> I suppose a custom restart strategy, or I'd call it a custom
> > > >>>>>> failure "handler", is indeed necessary. It can also enrich the
> > > >>>>>> information as the current proposed handler does.
> > > >>>>>>
> > > >>>>>> To avoid adding too many plugin interfaces which may confuse
> users
> > > >>>>>> and make the ExecutionFailureHandler more complex,
> > > >>>>>> I think it'd be better to consider the requirements at the same
> > > >> time.
> > > >>>>>>
> > > >>>>>> IMO, we can add a handler interface, then make the current
> restart
> > > >>>>>> strategy and the enricher both types of the handler. The
> handlers
> > > >>>>>> execute in sequence, and the failure is considered unrecoverable
> > if
> > > >>>>>> any of the handlers decides.
> > > >>>>>> In this way, users can also implement a handler using the
> enriched
> > > >>>>>> information provided by the previous handlers, e.g. fail the job
> > > >> and
> > > >>>>>> send a notification if too many failures are caused by the end
> > > >> users.
> > > >>>>>>
> > > >>>>>> Best,
> > > >>>>>> Gen
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> On Tue, Mar 21, 2023 at 11:38 AM Weihua Hu <
> > [email protected]
> > > >>>
> > > >>>> wrote:
> > > >>>>>>
> > > >>>>>>> Hi Panagiotis,
> > > >>>>>>>
> > > >>>>>>> Thanks for your proposal. It is valuable to analyze the reason
> > > >> for
> > > >>>>>>> failure with the user plug-in.
> > > >>>>>>>
> > > >>>>>>> Making the context immutable could make the contract stronger.
> > > >>>>>>> Letting the listener return an enriching result may be a better
> > > >>> way.
> > > >>>>>>>
> > > >>>>>>> IIUC, listeners could do two things, enrich more information
> > > >>>>>> (tags/labels)
> > > >>>>>>> to FailureHandlingResult, and push data out of Flink (metrics
> or
> > > >>>>>>> something).
> > > >>>>>>> IMO, we could split these two types into Listener and Advisor
> > > >>> (maybe
> > > >>>>>>> other names). The Listener just pushes the data out and returns
> > > >>>> nothing
> > > >>>>>> to
> > > >>>>>>> Flink, so we can run these async and don't have to wait for
> > > >>>> Listener's
> > > >>>>>>> result.
> > > >>>>>>> The Advisor returns rich information to the
> FailureHadingResult,
> > > >>>> and it
> > > >>>>>>> should
> > > >>>>>>> have a lighter logic.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> Supporting a custom restart strategy is also valuable. In this
> > > >>>> design, we
> > > >>>>>>> use
> > > >>>>>>> RestartStrategy to construct a FailureHandingResult, and then
> > > >> pass
> > > >>>> it to
> > > >>>>>>> Listener.
> > > >>>>>>> My question is, should we change the restart strategy interface
> > > >> to
> > > >>>>>> support
> > > >>>>>>> the
> > > >>>>>>> custom restart strategy, or keep the current restart strategy
> and
> > > >>>> let the
> > > >>>>>>> later
> > > >>>>>>> Listener enrich the restartable information to
> > > >>> FailureHandingResult?
> > > >>>> The
> > > >>>>>>> latter
> > > >>>>>>> may cause some confusion when we use a custom restart strategy.
> > > >>>>>>> The default flink restart strategy also runs but does not take
> > > >>>> effect.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> Best,
> > > >>>>>>> Weihua
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Mon, Mar 20, 2023 at 11:42 PM Lijie Wang <
> > > >>>> [email protected]>
> > > >>>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Hi Panagiotis,
> > > >>>>>>>>
> > > >>>>>>>> Thanks for driving this.
> > > >>>>>>>>
> > > >>>>>>>> +1 for supporting custom restart strategy, we did receive such
> > > >>>> requests
> > > >>>>>>>> from the user mailing list [1][2].
> > > >>>>>>>>
> > > >>>>>>>> Besides, in current design, the plugin will only do some
> > > >>>> statistical
> > > >>>>>> and
> > > >>>>>>>> classification work, and will not affect the
> > > >>>> *FailureHandlingResult*.
> > > >>>>>>> Just
> > > >>>>>>>> listening, no handling, it doesn't quite match the title.
> > > >>>>>>>>
> > > >>>>>>>> [1]
> > > >>>> https://lists.apache.org/thread/ch3s4jhh09wnff3tscqnb6btp2zlp2r1
> > > >>>>>>>> [2]
> > > >>>> https://lists.apache.org/thread/lwjfdr7c1ypo77r4rwojdk7kxx2sw4sx
> > > >>>>>>>>
> > > >>>>>>>> Best,
> > > >>>>>>>> Lijie
> > > >>>>>>>>
> > > >>>>>>>> Zhu Zhu <[email protected]> 于2023年3月20日周一 21:39写道：
> > > >>>>>>>>
> > > >>>>>>>>> Hi Panagiotis,
> > > >>>>>>>>>
> > > >>>>>>>>> Thanks for creating this proposal! It's good to enable Flink
> > > >> to
> > > >>>>>> handle
> > > >>>>>>>>> different errors in different ways, through a pluggable way.
> > > >>>>>>>>>
> > > >>>>>>>>> There are requests for flexible restart strategies from time
> > > >> to
> > > >>>> time,
> > > >>>>>>> for
> > > >>>>>>>>> different strategies of restart backoff time, or to suppress
> > > >>>>>> restarting
> > > >>>>>>>>> on certain errors. Therefore, I think it's better that the
> > > >>>> proposed
> > > >>>>>>>>> failure handling plugin can also support custom restart
> > > >>>> strategies.
> > > >>>>>>>>>
> > > >>>>>>>>> Maybe we can call it FailureHandlingAdvisor which provides
> > > >> more
> > > >>>>>>>>> information (labels) and gives advice (restart backoff time,
> > > >>>> whether
> > > >>>>>>>>> to restart)? I do not have a strong opinion though, any
> > > >>>> explanatory
> > > >>>>>>>>> name would be good.
> > > >>>>>>>>>
> > > >>>>>>>>> To avoid unexpected mutation, how about to make the context
> > > >>>> immutable
> > > >>>>>>>>> and let the plugin return an immutable result? i.e. remove
> > > >> the
> > > >>>>>> setters
> > > >>>>>>>>> from the context, and let the plugin method return a result
> > > >>> which
> > > >>>>>>>>> contains `labels`, `canRestart` and `restartBackoffTime`.
> > > >> Flink
> > > >>>>>> should
> > > >>>>>>>>> apply the result to the context before invoking the next
> > > >>> plugin,
> > > >>>> so
> > > >>>>>>>>> that the next plugin will see the updated context.
> > > >>>>>>>>>
> > > >>>>>>>>> The plugin should avoid taking too much time to return the
> > > >>>> result,
> > > >>>>>>>> because
> > > >>>>>>>>> it will block the RPC and result in instability. However, it
> > > >>> can
> > > >>>>>> still
> > > >>>>>>>>> perform heavy actions in a different thread. The context can
> > > >>>> provide
> > > >>>>>> an
> > > >>>>>>>>> `ioExecutor` to the plugins for reuse.
> > > >>>>>>>>>
> > > >>>>>>>>> Thanks,
> > > >>>>>>>>> Zhu
> > > >>>>>>>>>
> > > >>>>>>>>> Shammon FY <[email protected]> 于2023年3月20日周一 20:21写道：
> > > >>>>>>>>>>
> > > >>>>>>>>>> Hi Panagiotis
> > > >>>>>>>>>>
> > > >>>>>>>>>> Thank you for your answer. I agree that `FailureListener`
> > > >>>> could be
> > > >>>>>>>>>> stateless, then I have some thoughts as follows
> > > >>>>>>>>>>
> > > >>>>>>>>>> 1. I see that listeners and tag collections are associated.
> > > >>>> When
> > > >>>>>>>>> JobManager
> > > >>>>>>>>>> fails and restarts, how can the new listener be associated
> > > >>>> with the
> > > >>>>>>> tag
> > > >>>>>>>>>> collection before failover? Is the listener loading order?
> > > >>>>>>>>>>
> > > >>>>>>>>>> 2. The tag collection may be too large, resulting in the
> > > >>>> JobManager
> > > >>>>>>>> OOM,
> > > >>>>>>>>> do
> > > >>>>>>>>>> we need to provide a management class that supports some
> > > >>>>>> obsolescence
> > > >>>>>>>>>> strategies instead of a direct Collection?
> > > >>>>>>>>>>
> > > >>>>>>>>>> 3. Is it possible to provide a more complex data structure
> > > >>>> than a
> > > >>>>>>>> simple
> > > >>>>>>>>>> string collection for tags in listeners, such as key-value?
> > > >>>>>>>>>>
> > > >>>>>>>>>> Best,
> > > >>>>>>>>>> Shammon FY
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Mon, Mar 20, 2023 at 7:48 PM Leonard Xu <
> > > >>> [email protected]>
> > > >>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Hi，Panagiotis
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thank you for kicking off this discussion. Overall, the
> > > >>>> proposed
> > > >>>>>>>>> feature of
> > > >>>>>>>>>>> this FLIP makes sense to me. We have also discussed
> > > >> similar
> > > >>>>>>>>> requirements
> > > >>>>>>>>>>> with our users and developers, and I believe it will help
> > > >>>> many
> > > >>>>>>> users.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> In terms of FLIP content, I have some thoughts:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> (1) For the FailureListenerContextget interface, the
> > > >>> methods
> > > >>>>>>>>>>> FailureListenerContext#addTag and
> > > >>>> FailureListenerContextgetTags
> > > >>>>>>> looks
> > > >>>>>>>>> very
> > > >>>>>>>>>>> inconsistent because they imply specific implementation
> > > >>>> details,
> > > >>>>>>> and
> > > >>>>>>>>> not
> > > >>>>>>>>>>> all FailureListeners need to handle them, we shouldn't
> > > >> put
> > > >>>> them
> > > >>>>>> in
> > > >>>>>>>> the
> > > >>>>>>>>>>> interface. Minor: The comment "UDF loading" in the
> > > >>>>>>>> getUserClassLoader()
> > > >>>>>>>>>>> method looks like a typo, IIUC it should return the
> > > >>>> classloader
> > > >>>>>> of
> > > >>>>>>>> the
> > > >>>>>>>>>>> current job.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> (2) Regarding the implementation in
> > > >>>>>>>>> ExecutionFailureHandler#handleFailure,
> > > >>>>>>>>>>> some custom listeners may have heavy IO operations, such
> > > >> as
> > > >>>>>>> reporting
> > > >>>>>>>>> to
> > > >>>>>>>>>>> their monitoring system. The current logic appears to be
> > > >>>>>> processing
> > > >>>>>>>> in
> > > >>>>>>>>> the
> > > >>>>>>>>>>> JobMaster's main thread, and it is recommended not to do
> > > >>> this
> > > >>>>>> kind
> > > >>>>>>> of
> > > >>>>>>>>>>> processing in the main thread.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> (3) The results of FailureListener's processing and the
> > > >>>>>>>>>>> FailureHandlingResult returned by ExecutionFailureHandler
> > > >>>> are not
> > > >>>>>>>>> related.
> > > >>>>>>>>>>> I think these two are closely related, the motivation of
> > > >>> this
> > > >>>>>> FLIP
> > > >>>>>>> is
> > > >>>>>>>>> to
> > > >>>>>>>>>>> make current failure handling more flexible. From this
> > > >>>>>> perspective,
> > > >>>>>>>>>>> different listeners should have the opportunity to affect
> > > >>> the
> > > >>>>>> job's
> > > >>>>>>>>> failure
> > > >>>>>>>>>>> handling flow. For example, a Flink job is configured
> > > >> with
> > > >>> a
> > > >>>>>>>>>>> RestartStrategy with huge numbers retry , but the Kafka
> > > >>>> topic of
> > > >>>>>>>>> Source has
> > > >>>>>>>>>>> been deleted, the job will failover continuously. In this
> > > >>>> case,
> > > >>>>>> the
> > > >>>>>>>>> user
> > > >>>>>>>>>>> should have their listener to determine whether this
> > > >>> failure
> > > >>>> is
> > > >>>>>>>>> recoverable
> > > >>>>>>>>>>> or unrecoverable, and then wrap the processing result
> > > >> into
> > > >>>>>>>>>>> FailureHandlingResult.unrecoverable(xx) and pass it to
> > > >>>> JobMaster,
> > > >>>>>>>> this
> > > >>>>>>>>>>> approach will be more flexible.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> (4) All FLIPs have an important section named Public
> > > >>>> Interfaces.
> > > >>>>>>>>> Current
> > > >>>>>>>>>>> FLIP mixes the interface section and the implementation
> > > >>>> section
> > > >>>>>>>>> together.
> > > >>>>>>>>>>> It is better for us to refer to the FLIP template[1] and
> > > >>>> separate
> > > >>>>>>>> them,
> > > >>>>>>>>>>> this will make the entire FLIP clearer.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> In addition, regarding the FLIP process, there is a small
> > > >>>>>>> suggestion:
> > > >>>>>>>>> The
> > > >>>>>>>>>>> community generally creates a JIRA issue after the FLIP
> > > >>> vote
> > > >>>> is
> > > >>>>>>>> passed,
> > > >>>>>>>>>>> instead of during the FLIP preparation phase because the
> > > >>>> FLIP may
> > > >>>>>>> be
> > > >>>>>>>>>>> rejected. Although this FLIP is very reasonable, it's
> > > >>> better
> > > >>>> to
> > > >>>>>>>> follow
> > > >>>>>>>>> the
> > > >>>>>>>>>>> process.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Best,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Leonard
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> [1]
> > > >>>>>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP+Template
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Mon, Mar 20, 2023 at 7:04 PM David Morávek <
> > > >>>> [email protected]>
> > > >>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> however listeners can use previous state
> > > >> (tags/labels)
> > > >>> to
> > > >>>>>> make
> > > >>>>>>>>>>> decisions
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> That sounds like a very fragile contract. We should
> > > >>> either
> > > >>>>>> allow
> > > >>>>>>>>> passing
> > > >>>>>>>>>>>> tags between listeners and then need to define ordering
> > > >>> or
> > > >>>> make
> > > >>>>>>> all
> > > >>>>>>>>> of
> > > >>>>>>>>>>> them
> > > >>>>>>>>>>>> independent. I prefer the latter because it allows us
> > > >> to
> > > >>>>>>>> parallelize
> > > >>>>>>>>>>> things
> > > >>>>>>>>>>>> if needed (if all listeners trigger an RCP to the
> > > >>> external
> > > >>>>>>> system,
> > > >>>>>>>>> for
> > > >>>>>>>>>>>> example).
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Can you expand on why we need more than one classifier
> > > >> to
> > > >>>> be
> > > >>>>>> able
> > > >>>>>>>> to
> > > >>>>>>>>>>> output
> > > >>>>>>>>>>>> the same tag?
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> system ones come first and then the ones loaded from
> > > >> the
> > > >>>> plugin
> > > >>>>>>>>> manager
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Since they're returned as a Set, the order is
> > > >> completely
> > > >>>>>>>>>>> non-deterministic,
> > > >>>>>>>>>>>> no matter in which order they're loaded.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> just communicating with external monitoring/alerting
> > > >>>> systems
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> That makes the need for pushing things out of the main
> > > >>>> thread
> > > >>>>>>> even
> > > >>>>>>>>>>>> stronger. This almost sounds like we need to return a
> > > >>>>>>>>> CompletableFuture
> > > >>>>>>>>>>> for
> > > >>>>>>>>>>>> the per-throwable classification because an external
> > > >>> system
> > > >>>>>> might
> > > >>>>>>>>> take a
> > > >>>>>>>>>>>> significant time to respond. We need to unblock the
> > > >> main
> > > >>>> thread
> > > >>>>>>> for
> > > >>>>>>>>> other
> > > >>>>>>>>>>>> RPCs.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Also, in the proposal, this happens in the failure
> > > >>>> handler. If
> > > >>>>>>>>> that's the
> > > >>>>>>>>>>>> case, this might block the job from being restarted (if
> > > >>> the
> > > >>>>>>> restart
> > > >>>>>>>>>>>> strategy allows for another restart), which would be
> > > >>> great
> > > >>>> to
> > > >>>>>>> avoid
> > > >>>>>>>>>>> because
> > > >>>>>>>>>>>> it can introduce extra downtime.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> This raises another question: what should happen if the
> > > >>>>>>>>> classification
> > > >>>>>>>>>>>> fails? Crashing the job (which is what's currently
> > > >>>> proposed)
> > > >>>>>>> seems
> > > >>>>>>>>> very
> > > >>>>>>>>>>>> dangerous if this might depend on an external system.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Thats a valid point, passing the JobGraph containing
> > > >> all
> > > >>>> the
> > > >>>>>>> above
> > > >>>>>>>>>>>>> information is also something to consider
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> We should avoid passing JG around because it's mutable
> > > >>>> (which
> > > >>>>>> we
> > > >>>>>>>>> must fix
> > > >>>>>>>>>>>> in the long term), and letting users change it might
> > > >> have
> > > >>>>>>>>> consequences.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Best,
> > > >>>>>>>>>>>> D.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Mon, Mar 20, 2023 at 7:23 AM Panagiotis Garefalakis
> > > >> <
> > > >>>>>>>>>>> [email protected]>
> > > >>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Hey David, Shammon,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Thanks for the valuable comments!
> > > >>>>>>>>>>>>> I am glad you find this proposal useful, some
> > > >> thoughts:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> @Shammon
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> 1. How about adding more job information in
> > > >>>>>>>>> FailureListenerContext? For
> > > >>>>>>>>>>>>>> example, job vertext, subtask, taskmanager
> > > >> location.
> > > >>>> And
> > > >>>>>> then
> > > >>>>>>>>> user
> > > >>>>>>>>>>> can
> > > >>>>>>>>>>>> do
> > > >>>>>>>>>>>>>> more statistics according to different dimensions.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Thats a valid point, passing the JobGraph containing
> > > >>> all
> > > >>>> the
> > > >>>>>>>> above
> > > >>>>>>>>>>>>> information
> > > >>>>>>>>>>>>> is also something to consider, I was mostly trying to
> > > >>> be
> > > >>>>>>>>> conservative:
> > > >>>>>>>>>>>>> i.e., passingly only the information we need, and
> > > >>> extend
> > > >>>> as
> > > >>>>>> we
> > > >>>>>>>> see
> > > >>>>>>>>> fit
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> 2. Users may want to save results in listener, and
> > > >> then
> > > >>>> they
> > > >>>>>>> can
> > > >>>>>>>>> get
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>>> historical results even jabmanager failover. Can we
> > > >>>>>> provide a
> > > >>>>>>>>> unified
> > > >>>>>>>>>>>>>> implementation for data storage requirements?
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> The idea is to store only the output of the Listeners
> > > >>>> (tags)
> > > >>>>>>> and
> > > >>>>>>>>> treat
> > > >>>>>>>>>>>> them
> > > >>>>>>>>>>>>> as stateless.
> > > >>>>>>>>>>>>> Tags are be stored along with HistoryEntries, and
> > > >> will
> > > >>> be
> > > >>>>>>>> available
> > > >>>>>>>>>>>> through
> > > >>>>>>>>>>>>> the HistoryServer
> > > >>>>>>>>>>>>> even after a JM dies.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> @David
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> 1) Should we also consider adding labels? The
> > > >>>> combination of
> > > >>>>>>> tags
> > > >>>>>>>>> and
> > > >>>>>>>>>>>>>> labels seems to be what most systems offer;
> > > >>> sometimes,
> > > >>>> they
> > > >>>>>>>> offer
> > > >>>>>>>>>>>> labels
> > > >>>>>>>>>>>>>> only (key=value pairs) because tags can be
> > > >>> implemented
> > > >>>>>> using
> > > >>>>>>>>> those,
> > > >>>>>>>>>>> but
> > > >>>>>>>>>>>>> not
> > > >>>>>>>>>>>>>> the other way around.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Indeed changing tags to k:v labels could be more
> > > >>>> expressive,
> > > >>>>>> I
> > > >>>>>>>>> like it!
> > > >>>>>>>>>>>>> Let's see what others think.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> 2) Since we can not predict how heavy user-defined
> > > >>> models
> > > >>>>>>>>> ("listeners")
> > > >>>>>>>>>>>> are
> > > >>>>>>>>>>>>>> going to be, it would be great to keep the
> > > >>>> interfaces/data
> > > >>>>>>>>> structures
> > > >>>>>>>>>>>>>> immutable so we can push things over to the I/O
> > > >>>> threads.
> > > >>>>>>> Also,
> > > >>>>>>>> it
> > > >>>>>>>>>>>> sounds
> > > >>>>>>>>>>>>>> off to call the main interface a Listener since
> > > >> it's
> > > >>>>>> supposed
> > > >>>>>>>> to
> > > >>>>>>>>>>>> enhance
> > > >>>>>>>>>>>>>> the original throwable with additional metadata.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> The idea was for the name to be generic as there
> > > >> could
> > > >>> be
> > > >>>>>>>> Listener
> > > >>>>>>>>>>>>> implementations
> > > >>>>>>>>>>>>> just communicating with external monitoring/alerting
> > > >>>> systems
> > > >>>>>>> and
> > > >>>>>>>> no
> > > >>>>>>>>>>>>> metadata output
> > > >>>>>>>>>>>>> -- but lets rethink that. For immutability, see
> > > >> below:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> 3) You're proposing to support a set of listeners.
> > > >>> Since
> > > >>>>>> you're
> > > >>>>>>>>> passing
> > > >>>>>>>>>>>> the
> > > >>>>>>>>>>>>>> mutable context around, which includes tags set by
> > > >>> the
> > > >>>>>>> previous
> > > >>>>>>>>>>>> listener,
> > > >>>>>>>>>>>>>> do you expect users to make any assumptions about
> > > >> the
> > > >>>> order
> > > >>>>>>> in
> > > >>>>>>>>> which
> > > >>>>>>>>>>>>>> listeners are executed?
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> In the existing proposal we are not making any
> > > >>>> assumptions
> > > >>>>>>> about
> > > >>>>>>>>> the
> > > >>>>>>>>>>>> order
> > > >>>>>>>>>>>>> of listeners,
> > > >>>>>>>>>>>>> (system ones come first and then the ones loaded from
> > > >>> the
> > > >>>>>>> plugin
> > > >>>>>>>>>>> manager)
> > > >>>>>>>>>>>>> however listeners can use previous state
> > > >> (tags/labels)
> > > >>> to
> > > >>>>>> make
> > > >>>>>>>>>>> decisions:
> > > >>>>>>>>>>>>> e.g., wont assign *UNKNOWN* failureType when we have
> > > >>>> already
> > > >>>>>>> seen
> > > >>>>>>>>> *USER
> > > >>>>>>>>>>>> *or
> > > >>>>>>>>>>>>> the other way around -- when we have seen *UNKNOWN*
> > > >>>> remove in
> > > >>>>>>>>> favor of
> > > >>>>>>>>>>>>> *USER*
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Cheers,
> > > >>>>>>>>>>>>> Panagiotis
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On Sun, Mar 19, 2023 at 10:42 AM David Morávek <
> > > >>>>>>> [email protected]>
> > > >>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Hi Panagiotis,
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> This is an excellent proposal and something
> > > >> everyone
> > > >>>> trying
> > > >>>>>>> to
> > > >>>>>>>>>>> provide
> > > >>>>>>>>>>>>>> "Flink as a service" needs to solve at some point.
> > > >> I
> > > >>>> have a
> > > >>>>>>>>> couple of
> > > >>>>>>>>>>>>>> questions:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> If I understand the proposal correctly, this is
> > > >> just
> > > >>>> about
> > > >>>>>>>> adding
> > > >>>>>>>>>>> tags
> > > >>>>>>>>>>>> to
> > > >>>>>>>>>>>>>> the Throwable by running a tuple of (Throwable,
> > > >>>>>>> FailureContext)
> > > >>>>>>>>>>>> through a
> > > >>>>>>>>>>>>>> user-defined model.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> 1) Should we also consider adding labels? The
> > > >>>> combination
> > > >>>>>> of
> > > >>>>>>>>> tags and
> > > >>>>>>>>>>>>>> labels seems to be what most systems offer;
> > > >>> sometimes,
> > > >>>> they
> > > >>>>>>>> offer
> > > >>>>>>>>>>>> labels
> > > >>>>>>>>>>>>>> only (key=value pairs) because tags can be
> > > >>> implemented
> > > >>>>>> using
> > > >>>>>>>>> those,
> > > >>>>>>>>>>> but
> > > >>>>>>>>>>>>> not
> > > >>>>>>>>>>>>>> the other way around.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> 2) Since we can not predict how heavy user-defined
> > > >>>> models
> > > >>>>>>>>>>> ("listeners")
> > > >>>>>>>>>>>>> are
> > > >>>>>>>>>>>>>> going to be, it would be great to keep the
> > > >>>> interfaces/data
> > > >>>>>>>>> structures
> > > >>>>>>>>>>>>>> immutable so we can push things over to the I/O
> > > >>>> threads.
> > > >>>>>>> Also,
> > > >>>>>>>> it
> > > >>>>>>>>>>>> sounds
> > > >>>>>>>>>>>>>> off to call the main interface a Listener since
> > > >> it's
> > > >>>>>> supposed
> > > >>>>>>>> to
> > > >>>>>>>>>>>> enhance
> > > >>>>>>>>>>>>>> the original throwable with additional metadata.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> I'd propose something along the lines of (we should
> > > >>>> have
> > > >>>>>>> better
> > > >>>>>>>>>>> names,
> > > >>>>>>>>>>>>> this
> > > >>>>>>>>>>>>>> is just to outline the idea):
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> interface FailureEnricher {
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>  ThrowableWithTagsAndLabels
> > > >> enrichFailure(Throwable
> > > >>>> cause,
> > > >>>>>>>>>>>>>> ImmutableContextualMetadataAboutTheThrowable
> > > >>> context);
> > > >>>>>>>>>>>>>> }
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> The names should change; this is just to outline
> > > >> the
> > > >>>> idea.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> 3) You're proposing to support a set of listeners.
> > > >>>> Since
> > > >>>>>>> you're
> > > >>>>>>>>>>> passing
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>> mutable context around, which includes tags set by
> > > >>> the
> > > >>>>>>> previous
> > > >>>>>>>>>>>> listener,
> > > >>>>>>>>>>>>>> do you expect users to make any assumptions about
> > > >> the
> > > >>>> order
> > > >>>>>>> in
> > > >>>>>>>>> which
> > > >>>>>>>>>>>>>> listeners are executed?
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> *@Shammon*
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Users may want to save results in listener, and
> > > >> then
> > > >>>> they
> > > >>>>>> can
> > > >>>>>>>>> get the
> > > >>>>>>>>>>>>>>> historical results even jabmanager failover. Can
> > > >> we
> > > >>>>>>> provide a
> > > >>>>>>>>>>> unified
> > > >>>>>>>>>>>>>>> implementation for data storage requirements?
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> I think we should explicitly state that all
> > > >>>> "listeners" are
> > > >>>>>>>>> treated
> > > >>>>>>>>>>> as
> > > >>>>>>>>>>>>>> stateless. I don't see any strong reason for
> > > >>>> snapshotting
> > > >>>>>>> them.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>> D.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> On Sat, Mar 18, 2023 at 1:00 AM Shammon FY <
> > > >>>>>>> [email protected]>
> > > >>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Hi Panagiotis
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Thank you for starting this discussion. I think
> > > >>> this
> > > >>>> FLIP
> > > >>>>>>> is
> > > >>>>>>>>>>> valuable
> > > >>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>> can help user to analyze the causes of job
> > > >> failover
> > > >>>>>> better!
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> I have two comments as follows
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> 1. How about adding more job information in
> > > >>>>>>>>> FailureListenerContext?
> > > >>>>>>>>>>>> For
> > > >>>>>>>>>>>>>>> example, job vertext, subtask, taskmanager
> > > >>> location.
> > > >>>> And
> > > >>>>>>> then
> > > >>>>>>>>> user
> > > >>>>>>>>>>>> can
> > > >>>>>>>>>>>>> do
> > > >>>>>>>>>>>>>>> more statistics according to different
> > > >> dimensions.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> 2. Users may want to save results in listener,
> > > >> and
> > > >>>> then
> > > >>>>>>> they
> > > >>>>>>>>> can
> > > >>>>>>>>>>> get
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>> historical results even jabmanager failover. Can
> > > >> we
> > > >>>>>>> provide a
> > > >>>>>>>>>>> unified
> > > >>>>>>>>>>>>>>> implementation for data storage requirements?
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>>> shammon FY
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On Saturday, March 18, 2023, Panagiotis
> > > >>> Garefalakis <
> > > >>>>>>>>>>>> [email protected]
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Hi everyone,
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> This FLIP [1] proposes a pluggable interface
> > > >> for
> > > >>>>>> failure
> > > >>>>>>>>> handling
> > > >>>>>>>>>>>>>>> allowing
> > > >>>>>>>>>>>>>>>> users to implement custom failure logic using
> > > >> the
> > > >>>>>> plugin
> > > >>>>>>>>>>> framework.
> > > >>>>>>>>>>>>>>>> Motivated by existing proposals [2] and tickets
> > > >>>> [3],
> > > >>>>>> this
> > > >>>>>>>>> enables
> > > >>>>>>>>>>>>>>> use-cases
> > > >>>>>>>>>>>>>>>> like: assigning particular types to failures
> > > >>> (e.g.,
> > > >>>>>> User
> > > >>>>>>> or
> > > >>>>>>>>>>>> System),
> > > >>>>>>>>>>>>>>>> emitting custom metrics per type (e.g.,
> > > >>>> application or
> > > >>>>>>>>> platform),
> > > >>>>>>>>>>>>> even
> > > >>>>>>>>>>>>>>>> exposing errors to downstream consumers (e.g.,
> > > >>>>>>> notification
> > > >>>>>>>>>>>> systems).
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Thanks to Piotr and Anton for the initial
> > > >> reviews
> > > >>>> and
> > > >>>>>>>>>>> discussions!
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> For anyone interested, the starting point would
> > > >>> be
> > > >>>> the
> > > >>>>>>> FLIP
> > > >>>>>>>>> [1]
> > > >>>>>>>>>>>> that
> > > >>>>>>>>>>>>> I
> > > >>>>>>>>>>>>>>>> created,
> > > >>>>>>>>>>>>>>>> describing the motivation and the proposed
> > > >>> changes
> > > >>>>>> (part
> > > >>>>>>> of
> > > >>>>>>>>> the
> > > >>>>>>>>>>>> core,
> > > >>>>>>>>>>>>>>>> runtime and web).
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> The intuition behind this FLIP is being able to
> > > >>>> execute
> > > >>>>>>>>> custom
> > > >>>>>>>>>>>> logic
> > > >>>>>>>>>>>>> on
> > > >>>>>>>>>>>>>>>> failures by exposing a FailureListener
> > > >> interface.
> > > >>>>>>>>> Implementation
> > > >>>>>>>>>>> by
> > > >>>>>>>>>>>>>> users
> > > >>>>>>>>>>>>>>>> can be simply loaded to the system as Jar
> > > >> files.
> > > >>>>>>>>> FailureListeners
> > > >>>>>>>>>>>> may
> > > >>>>>>>>>>>>>>> also
> > > >>>>>>>>>>>>>>>> decide to assign failure tags to errors
> > > >>> (expressed
> > > >>>> as
> > > >>>>>>>>> strings),
> > > >>>>>>>>>>>>>>>> that will then be exposed as metadata by the
> > > >>>> UI/Rest
> > > >>>>>>>>> interfaces.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Feedback is always appreciated! Looking forward
> > > >>> to
> > > >>>> your
> > > >>>>>>>>> thoughts!
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> [1]
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%
> > > >>>>>>>>>>>>>>>> 3A+Pluggable+failure+handling+for+Apache+Flink
> > > >>>>>>>>>>>>>>>> [2]
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>> https://docs.google.com/document/d/1pcHg9F3GoDDeVD5GIIo2wO67
> > > >>>>>>>>>>>>>>>> Hmjgy0-hRDeuFnrMgT4
> > > >>>>>>>>>>>>>>>> [3]
> > > >>>> https://issues.apache.org/jira/browse/FLINK-20833
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Cheers,
> > > >>>>>>>>>>>>>>>> Panagiotis
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> > >
> >
>

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

Reply via email to