Thanks Panagiotis for the update,
the updated FLIP looks good to me.
Best,
Leonard
> On Apr 13, 2023, at 7:42 AM, Panagiotis Garefalakis wrote:
>
> Hello there,
>
> Zhu: agree with the config option, great suggestion
> Hong: global timeout is also interesting and a good addition -- only
>
Hello there,
Zhu: agree with the config option, great suggestion
Hong: global timeout is also interesting and a good addition -- only
downside I see is just another config option
If everyone is happy, I suggest we keep the discussion open until Friday
and start a Vote shortly after.
Cheers,
Pana
Hi Panagiotis,
Thank you for the update. Looks great! Just one suggestion below:
1. We seem to be waiting for the future(s) to complete before restarting the
job - should we add a configurable timeout for the enrichment? Since each
failure enricher are run in parallel, we could probably settle
Hi Panagiotis,
Thanks for updating the FLIP.
> Regarding the config option `jobmanager.failure-enricher-plugins.enabled`
I think a config option `jobmanager.failure-enrichers`, which accepts
the names of enrichers to use, may be better. It allows the users to
deploy and use the plugins in a more
Hello again everyone,
FLIP is now updated based on our discussion!
In short, FLIP-304 [1] proposes the addition of a pluggable interface that
will allow users to add custom logic and enrich failures with custom
metadata labels.
While as discussed, custom restart strategies will be part of a differ
Hi Panagiotis,
How about to introduce a config option to control which error handling
plugins should be used? It is more flexible for deployments. Additionally,
it can also enable users to explicitly specify the order that the plugins
take effects.
Thanks,
Zhu
Gen Luo 于2023年3月27日周一 15:02写道:
>
>
Thanks for the summary!
Also +1 to support custom restart strategies in a different FLIP,
as long as we can make sure that the plugin interface won't be
changed when the restart strategy interface is introduced.
To achieve this, maybe we should think well how the handler
would cooperate with the
+1 to support custom restart strategies in a different FLIP.
It's fine to have a different plugin for custom restart strategy.
If so, since we do not treat the FLIP-304 plugin as a common failure
handler, but instead mainly targets to add labels to errors, I would
+1 for the name `FailureEnricher`
>
> One additional remark on introducing it as an async operation: We would
> need a new configuration parameter to define the timeout for such a
> listener call, wouldn't we?
>
This could be left up to the implementor to handle.
What about adding an extra method getNamespace() to the Listener in
Sounds good. Two points I want to add:
- Listener execution should be independent — however we need a way to
> enforce a Label key/key-prefix is only assigned to a single Listener,
> thinking of a validation step both at Listener init and runtime stages
>
What about adding an extra method getNa
Hi everyone,
Thanks for the valuable comments!
Excited to see this is an area of interest for the community!
Summarizing some of the main points raised along with my thoughts:
- Labels (Key/Value) pairs are more expressive than Tags (Strings) so
using the former is a good idea — I am also
Hi all,
Thank you Panagiotis for proposing this. From the size of the thread, this is a
much needed feature in Flink!
Some thoughts, to extend those already adeptly summarised by Piotr, Matthias
and Jing.
- scope of FLIP: +1 to scoping this FLIP to observability around a restart.
That would in
Hi,
Thanks Panagiotis for this FLIP and thanks for all valuable discussions.
I'd like to share my two cents:
- FailureListenerContext#addTag and FailureListenerContext#getTags. It
seems that we have to call getTags() and then do remove activities if we
want to delete any tags (according to the ja
Thanks for the proposal, Panagiotis. A lot of good points have been already
shared. I just want to add my view on some of the items:
- independent execution vs ordered execution: I prefer the listeners being
processed independently from each other because it adds less complexity
code-wise. The use
*@Piotr*
> I was thinking about actually defining the order of the
> classifiers/handlers and not allowing them to be asynchronous.
> Asynchronousity would create some problems: when to actually return the
> error to the user? After all async responses will get back? Before, but
> without classif
Hi Piotr,
It's fine to me to have a separate FLIP to extend this `FailureListener`
to support custom restart strategy.
What I was a bit concerned is that if we just treat the `FailureListener`
as an error classifier which is not crucial to Flink framework process,
we may design it to run asynchro
Hi Panagiotis,
Thanks for the proposal. This is a very valuable feature and will be a good
add-on for Flink.
I also think that it will be great if we can consider how to make it
possible for users to customize the failure handling in this FLIP. It's
highly related to the problem we want to addres
Hi,
It would be nice to have the custom error handler, but at the same time I'm
not sure if we really have to bundle those two things together in the same
FLIP. We can always make sure that the interface introduced here, in
FLIP-304, can be expanded in the future.
Re @David and @Leonard.
> The m
Hi Panagiotis,
Thanks for the proposal.
It's useful to enrich the information so that users can be more
clear why the job is failing, especially platform developers who
need to provide the information to their end users.
And for the very FLIP, I'd prefer the naming `FailureEnricher`
proposed by D
Hi Panagiotis,
Thanks for your proposal. It is valuable to analyze the reason for
failure with the user plug-in.
Making the context immutable could make the contract stronger.
Letting the listener return an enriching result may be a better way.
IIUC, listeners could do two things, enrich more in
Hi Panagiotis,
Thanks for driving this.
+1 for supporting custom restart strategy, we did receive such requests
from the user mailing list [1][2].
Besides, in current design, the plugin will only do some statistical and
classification work, and will not affect the *FailureHandlingResult*. Just
l
Hi Panagiotis,
Thanks for creating this proposal! It's good to enable Flink to handle
different errors in different ways, through a pluggable way.
There are requests for flexible restart strategies from time to time, for
different strategies of restart backoff time, or to suppress restarting
on c
Hi Panagiotis
Thank you for your answer. I agree that `FailureListener` could be
stateless, then I have some thoughts as follows
1. I see that listeners and tag collections are associated. When JobManager
fails and restarts, how can the new listener be associated with the tag
collection before fa
Hi,Panagiotis
Thank you for kicking off this discussion. Overall, the proposed feature of
this FLIP makes sense to me. We have also discussed similar requirements
with our users and developers, and I believe it will help many users.
In terms of FLIP content, I have some thoughts:
(1) For the F
>
> however listeners can use previous state (tags/labels) to make decisions
That sounds like a very fragile contract. We should either allow passing
tags between listeners and then need to define ordering or make all of them
independent. I prefer the latter because it allows us to parallelize th
Hey David, Shammon,
Thanks for the valuable comments!
I am glad you find this proposal useful, some thoughts:
@Shammon
1. How about adding more job information in FailureListenerContext? For
> example, job vertext, subtask, taskmanager location. And then user can do
> more statistics according t
Hi Panagiotis,
This is an excellent proposal and something everyone trying to provide
"Flink as a service" needs to solve at some point. I have a couple of
questions:
If I understand the proposal correctly, this is just about adding tags to
the Throwable by running a tuple of (Throwable, FailureC
Hi Panagiotis
Thank you for starting this discussion. I think this FLIP is valuable and
can help user to analyze the causes of job failover better!
I have two comments as follows
1. How about adding more job information in FailureListenerContext? For
example, job vertext, subtask, taskmanager lo
Hi everyone,
This FLIP [1] proposes a pluggable interface for failure handling allowing
users to implement custom failure logic using the plugin framework.
Motivated by existing proposals [2] and tickets [3], this enables use-cases
like: assigning particular types to failures (e.g., User or System
29 matches
Mail list logo