Re: [ DISCUSS ] FLIP-XXX : [Plugin] Enhancing Flink Failure Management in Kubernetes with Dynamic Termination Log Integration

Swathi C Thu, 18 Apr 2024 09:36:57 -0700

Hi Gyula and  Ahmed,

Thanks for reviewing this.

@gyula.f...@gmail.com <gyula.f...@gmail.com> , currently since our aim as
part of this FLIP was only to fail the cluster when job manager/flink has
issues such that the cluster would no longer be usable, hence, we proposed
only related to that.
Your right, that it covers only job main class errors, job manager run time
failures, if the Job manager wants to write any metadata to any other
system ( ABFS, S3 , ... )  and the job failures will not be covered.

FLIP-304 is mainly used to provide Failure enrichers for job failures.
Since, this FLIP is mainly for flink Job manager failures, let us know if
we can leverage the goodness of both and try to extend FLIP-304 and add our
plugin implementation to cover the job level issues ( propagate this info
to the /dev/termination-log such that, the container status reports it for
flink on K8S by implementing Failure Enricher interface and
processFailure() to do this ) and use this FLIP proposal for generic flink
cluster (Job manager/cluster ) failures.

Regards,
Swathi C

On Thu, Apr 18, 2024 at 7:36 PM Ahmed Hamdy <hamdy10...@gmail.com> wrote:

> Hi Swathi!
> Thanks for the proposal.
> Could you please elaborate what this FLIP offers more than Flip-304[1]?
> Flip 304 proposes a Pluggable mechanism for enriching Job failures, If I am
> not mistaken this proposal looks like a subset of it.
>
> 1-
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers
>
> Best Regards
> Ahmed Hamdy
>
>
> On Thu, 18 Apr 2024 at 08:23, Gyula Fóra <gyula.f...@gmail.com> wrote:
>
> > Hi Swathi!
> >
> > Thank you for creating this proposal. I really like the general idea of
> > increasing the K8s native observability of Flink job errors.
> >
> > I took a quick look at your reference PR, the termination log related
> logic
> > is contained completely in the ClusterEntrypoint. What type of errors
> will
> > this actually cover?
> >
> > To me this seems to cover only:
> >  - Job main class errors (ie startup errors)
> >  - JobManager failures
> >
> > Would regular job errors (that cause only job failover but not JM errors)
> > be reported somehow with this plugin?
> >
> > Thanks
> > Gyula
> >
> > On Tue, Apr 16, 2024 at 8:21 AM Swathi C <swathi.c.apa...@gmail.com>
> > wrote:
> >
> > > Hi All,
> > >
> > > I would like to start a discussion on FLIP-XXX : [Plugin] Enhancing
> Flink
> > > Failure Management in Kubernetes with Dynamic Termination Log
> > Integration.
> > >
> > >
> > >
> >
> https://docs.google.com/document/d/1tWR0Fi3w7VQeD_9VUORh8EEOva3q-V0XhymTkNaXHOc/edit?usp=sharing
> > >
> > >
> > > This FLIP proposes an improvement plugin and focuses mainly on Flink on
> > > K8S but can be used as a generic plugin and add further enhancements.
> > >
> > > Looking forward to everyone's feedback and suggestions. Thank you !!
> > >
> > > Best Regards,
> > > Swathi Chandrashekar
> > >
> >
>

Re: [ DISCUSS ] FLIP-XXX : [Plugin] Enhancing Flink Failure Management in Kubernetes with Dynamic Termination Log Integration

Reply via email to