Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-04-12 Thread Leonard Xu
Thanks Panagiotis for the update, the updated FLIP looks good to me. Best, Leonard > On Apr 13, 2023, at 7:42 AM, Panagiotis Garefalakis wrote: > > Hello there, > > Zhu: agree with the config option, great suggestion > Hong: global timeout is also interesting and a good addition -- only >

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-04-12 Thread Panagiotis Garefalakis
Hello there, Zhu: agree with the config option, great suggestion Hong: global timeout is also interesting and a good addition -- only downside I see is just another config option If everyone is happy, I suggest we keep the discussion open until Friday and start a Vote shortly after. Cheers, Pana

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-04-11 Thread Teoh, Hong
Hi Panagiotis, Thank you for the update. Looks great! Just one suggestion below: 1. We seem to be waiting for the future(s) to complete before restarting the job - should we add a configurable timeout for the enrichment? Since each failure enricher are run in parallel, we could probably settle

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-04-11 Thread Zhu Zhu
Hi Panagiotis, Thanks for updating the FLIP. > Regarding the config option `jobmanager.failure-enricher-plugins.enabled` I think a config option `jobmanager.failure-enrichers`, which accepts the names of enrichers to use, may be better. It allows the users to deploy and use the plugins in a more

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-04-09 Thread Panagiotis Garefalakis
Hello again everyone, FLIP is now updated based on our discussion! In short, FLIP-304 [1] proposes the addition of a pluggable interface that will allow users to add custom logic and enrich failures with custom metadata labels. While as discussed, custom restart strategies will be part of a differ

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-28 Thread Zhu Zhu
Hi Panagiotis, How about to introduce a config option to control which error handling plugins should be used? It is more flexible for deployments. Additionally, it can also enable users to explicitly specify the order that the plugins take effects. Thanks, Zhu Gen Luo 于2023年3月27日周一 15:02写道: > >

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-27 Thread Gen Luo
Thanks for the summary! Also +1 to support custom restart strategies in a different FLIP, as long as we can make sure that the plugin interface won't be changed when the restart strategy interface is introduced. To achieve this, maybe we should think well how the handler would cooperate with the

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-23 Thread Zhu Zhu
+1 to support custom restart strategies in a different FLIP. It's fine to have a different plugin for custom restart strategy. If so, since we do not treat the FLIP-304 plugin as a common failure handler, but instead mainly targets to add labels to errors, I would +1 for the name `FailureEnricher`

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-23 Thread David Morávek
> > One additional remark on introducing it as an async operation: We would > need a new configuration parameter to define the timeout for such a > listener call, wouldn't we? > This could be left up to the implementor to handle. What about adding an extra method getNamespace() to the Listener in

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-23 Thread Matthias Pohl
Sounds good. Two points I want to add: - Listener execution should be independent — however we need a way to > enforce a Label key/key-prefix is only assigned to a single Listener, > thinking of a validation step both at Listener init and runtime stages > What about adding an extra method getNa

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-22 Thread Panagiotis Garefalakis
Hi everyone, Thanks for the valuable comments! Excited to see this is an area of interest for the community! Summarizing some of the main points raised along with my thoughts: - Labels (Key/Value) pairs are more expressive than Tags (Strings) so using the former is a good idea — I am also

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-22 Thread Hong Teoh
Hi all, Thank you Panagiotis for proposing this. From the size of the thread, this is a much needed feature in Flink! Some thoughts, to extend those already adeptly summarised by Piotr, Matthias and Jing. - scope of FLIP: +1 to scoping this FLIP to observability around a restart. That would in

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-21 Thread Jing Ge
Hi, Thanks Panagiotis for this FLIP and thanks for all valuable discussions. I'd like to share my two cents: - FailureListenerContext#addTag and FailureListenerContext#getTags. It seems that we have to call getTags() and then do remove activities if we want to delete any tags (according to the ja

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-21 Thread Matthias Pohl
Thanks for the proposal, Panagiotis. A lot of good points have been already shared. I just want to add my view on some of the items: - independent execution vs ordered execution: I prefer the listeners being processed independently from each other because it adds less complexity code-wise. The use

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-21 Thread David Morávek
*@Piotr* > I was thinking about actually defining the order of the > classifiers/handlers and not allowing them to be asynchronous. > Asynchronousity would create some problems: when to actually return the > error to the user? After all async responses will get back? Before, but > without classif

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-21 Thread Zhu Zhu
Hi Piotr, It's fine to me to have a separate FLIP to extend this `FailureListener` to support custom restart strategy. What I was a bit concerned is that if we just treat the `FailureListener` as an error classifier which is not crucial to Flink framework process, we may design it to run asynchro

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-21 Thread Dian Fu
Hi Panagiotis, Thanks for the proposal. This is a very valuable feature and will be a good add-on for Flink. I also think that it will be great if we can consider how to make it possible for users to customize the failure handling in this FLIP. It's highly related to the problem we want to addres

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-21 Thread Piotr Nowojski
Hi, It would be nice to have the custom error handler, but at the same time I'm not sure if we really have to bundle those two things together in the same FLIP. We can always make sure that the interface introduced here, in FLIP-304, can be expanded in the future. Re @David and @Leonard. > The m

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-21 Thread Gen Luo
Hi Panagiotis, Thanks for the proposal. It's useful to enrich the information so that users can be more clear why the job is failing, especially platform developers who need to provide the information to their end users. And for the very FLIP, I'd prefer the naming `FailureEnricher` proposed by D

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-20 Thread Weihua Hu
Hi Panagiotis, Thanks for your proposal. It is valuable to analyze the reason for failure with the user plug-in. Making the context immutable could make the contract stronger. Letting the listener return an enriching result may be a better way. IIUC, listeners could do two things, enrich more in

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-20 Thread Lijie Wang
Hi Panagiotis, Thanks for driving this. +1 for supporting custom restart strategy, we did receive such requests from the user mailing list [1][2]. Besides, in current design, the plugin will only do some statistical and classification work, and will not affect the *FailureHandlingResult*. Just l

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-20 Thread Zhu Zhu
Hi Panagiotis, Thanks for creating this proposal! It's good to enable Flink to handle different errors in different ways, through a pluggable way. There are requests for flexible restart strategies from time to time, for different strategies of restart backoff time, or to suppress restarting on c

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-20 Thread Shammon FY
Hi Panagiotis Thank you for your answer. I agree that `FailureListener` could be stateless, then I have some thoughts as follows 1. I see that listeners and tag collections are associated. When JobManager fails and restarts, how can the new listener be associated with the tag collection before fa

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-20 Thread Leonard Xu
Hi,Panagiotis Thank you for kicking off this discussion. Overall, the proposed feature of this FLIP makes sense to me. We have also discussed similar requirements with our users and developers, and I believe it will help many users. In terms of FLIP content, I have some thoughts: (1) For the F

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-20 Thread David Morávek
> > however listeners can use previous state (tags/labels) to make decisions That sounds like a very fragile contract. We should either allow passing tags between listeners and then need to define ordering or make all of them independent. I prefer the latter because it allows us to parallelize th

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-19 Thread Panagiotis Garefalakis
Hey David, Shammon, Thanks for the valuable comments! I am glad you find this proposal useful, some thoughts: @Shammon 1. How about adding more job information in FailureListenerContext? For > example, job vertext, subtask, taskmanager location. And then user can do > more statistics according t

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-19 Thread David Morávek
Hi Panagiotis, This is an excellent proposal and something everyone trying to provide "Flink as a service" needs to solve at some point. I have a couple of questions: If I understand the proposal correctly, this is just about adding tags to the Throwable by running a tuple of (Throwable, FailureC

[DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-17 Thread Shammon FY
Hi Panagiotis Thank you for starting this discussion. I think this FLIP is valuable and can help user to analyze the causes of job failover better! I have two comments as follows 1. How about adding more job information in FailureListenerContext? For example, job vertext, subtask, taskmanager lo

[DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

2023-03-17 Thread Panagiotis Garefalakis
Hi everyone, This FLIP [1] proposes a pluggable interface for failure handling allowing users to implement custom failure logic using the plugin framework. Motivated by existing proposals [2] and tickets [3], this enables use-cases like: assigning particular types to failures (e.g., User or System