Re: [ DISCUSS ] FLIP-XXX : [Plugin] Enhancing Flink Failure Management in Kubernetes with Dynamic Termination Log Integration

ramkrishna vasudevan Tue, 23 Apr 2024 02:42:20 -0700

Hi Gyula and Ahmed,

I totally agree that there is an interlap in the final goal that both the
FLIPs are achieving here and infact FLIP-304 is more comprehensive for job
failures.


But as a proposal to move forward can we make Swathi's FLIP/JIRA as a sub
task for FLIP-304 and continue with the PR since the main aim is to get the
cluster failure pushed to the termination log for K8s based deployments.
And once it is completed we can work to make FLIP-304 to support job
failure propagation to termination log?

Regards
Ram

On Thu, Apr 18, 2024 at 10:07 PM Swathi C <swathi.c.apa...@gmail.com> wrote:

> Hi Gyula and  Ahmed,
>
> Thanks for reviewing this.
>
> @gyula.f...@gmail.com <gyula.f...@gmail.com> , currently since our aim as
> part of this FLIP was only to fail the cluster when job manager/flink has
> issues such that the cluster would no longer be usable, hence, we proposed
> only related to that.
> Your right, that it covers only job main class errors, job manager run time
> failures, if the Job manager wants to write any metadata to any other
> system ( ABFS, S3 , ... )  and the job failures will not be covered.
>
> FLIP-304 is mainly used to provide Failure enrichers for job failures.
> Since, this FLIP is mainly for flink Job manager failures, let us know if
> we can leverage the goodness of both and try to extend FLIP-304 and add our
> plugin implementation to cover the job level issues ( propagate this info
> to the /dev/termination-log such that, the container status reports it for
> flink on K8S by implementing Failure Enricher interface and
> processFailure() to do this ) and use this FLIP proposal for generic flink
> cluster (Job manager/cluster ) failures.
>
> Regards,
> Swathi C
>
> On Thu, Apr 18, 2024 at 7:36 PM Ahmed Hamdy <hamdy10...@gmail.com> wrote:
>
> > Hi Swathi!
> > Thanks for the proposal.
> > Could you please elaborate what this FLIP offers more than Flip-304[1]?
> > Flip 304 proposes a Pluggable mechanism for enriching Job failures, If I
> am
> > not mistaken this proposal looks like a subset of it.
> >
> > 1-
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers
> >
> > Best Regards
> > Ahmed Hamdy
> >
> >
> > On Thu, 18 Apr 2024 at 08:23, Gyula Fóra <gyula.f...@gmail.com> wrote:
> >
> > > Hi Swathi!
> > >
> > > Thank you for creating this proposal. I really like the general idea of
> > > increasing the K8s native observability of Flink job errors.
> > >
> > > I took a quick look at your reference PR, the termination log related
> > logic
> > > is contained completely in the ClusterEntrypoint. What type of errors
> > will
> > > this actually cover?
> > >
> > > To me this seems to cover only:
> > >  - Job main class errors (ie startup errors)
> > >  - JobManager failures
> > >
> > > Would regular job errors (that cause only job failover but not JM
> errors)
> > > be reported somehow with this plugin?
> > >
> > > Thanks
> > > Gyula
> > >
> > > On Tue, Apr 16, 2024 at 8:21 AM Swathi C <swathi.c.apa...@gmail.com>
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > I would like to start a discussion on FLIP-XXX : [Plugin] Enhancing
> > Flink
> > > > Failure Management in Kubernetes with Dynamic Termination Log
> > > Integration.
> > > >
> > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1tWR0Fi3w7VQeD_9VUORh8EEOva3q-V0XhymTkNaXHOc/edit?usp=sharing
> > > >
> > > >
> > > > This FLIP proposes an improvement plugin and focuses mainly on Flink
> on
> > > > K8S but can be used as a generic plugin and add further enhancements.
> > > >
> > > > Looking forward to everyone's feedback and suggestions. Thank you !!
> > > >
> > > > Best Regards,
> > > > Swathi Chandrashekar
> > > >
> > >
> >
>

Re: [ DISCUSS ] FLIP-XXX : [Plugin] Enhancing Flink Failure Management in Kubernetes with Dynamic Termination Log Integration

Reply via email to