Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD

Lajith Koova Mon, 14 Oct 2024 00:52:36 -0700

Thank you all for the valuable feedback .


Following the procedure outlined on the Flink Improvement Proposal

Confluence page [1], we kindly ask the PMC/Committers to transfer the

content from the Add K8S conditions to CRD's Status [2] and assign a

FLIP Number for us, which we will use for voting.


[1]

https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals#FlinkImprovementProposals-Process

[2]

https://docs.google.com/document/d/12wlJCL_Vq2KZnABzK7OR7gAd1jZMmo0MxgXQXqtWODs/edit?tab=t.0


Thanks

Lajith

On Mon, Sep 23, 2024 at 11:54 PM Gyula Fóra <[email protected]> wrote:

> Hey!
>
> I think the proposal is now simple enough :
>  - Running condition for Applications / SessionJobs
>  - Ready condition for Session clusters
>
> I think we should formalize this into a Flip page and start the vote on
> this from my side.
> The next step to consider is having an independent condition that captures
> the upgrade process itself (if a resource is fully upgraded / reconciled)
>
> Cheers,
> Gyula
>
> On Mon, Sep 23, 2024 at 12:16 PM David Radley <[email protected]>
> wrote:
>
> > Hi Lajith,
> > The updated document is much more detailed and looks good. As you say the
> > only situation that is not handled currently is when there are multiple
> > Flink jobs running in Application Mode.
> >
> > As discussed , you are looking to test this situation so we know how it
> > will perform.
> >
> > When you say “During transition of Job state, there will be only one
> > condition for the
> > Flink Deployment in application mode.”. I am not sure I understand.
> >
> >   *   I thought we have 1 condition per Flink job state, so I assume we
> > have one true condition and potentially other historical false ones.
> >   *   When you say during transition, are you thinking of some small time
> > window between states. I am not sure what you are saying here.
> >
> >
> > Kind regards , David
> >
> >
> > From: Lajith Koova <[email protected]>
> > Date: Wednesday, 11 September 2024 at 03:01
> > To: [email protected] <[email protected]>
> > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink
> CRD
> > Hi,
> >
> > Here is the updated Proposal doc
> > <
> >
> https://docs.google.com/document/d/12wlJCL_Vq2KZnABzK7OR7gAd1jZMmo0MxgXQXqtWODs/edit#heading=h.cz8x5nsncuwb
> > >
> > .
> >
> > *Summary : *
> >
> > Session Mode:
> >
> >    Status conditions will be populated with status of Job manager.
> >
> > Application Mode:
> >
> > 1. In application mode , status conditions will be populated with status
> of
> > Job running in the cluster.
> >
> > 2. Each Flink Job state will have one condition associated with.
> >
> > 3. During transition of Job state, there will be only one condition for
> the
> > Flink Deployment in application mode.
> >
> > 4. If there are multiple Jobs in application, how to handle them in
> > populating the condition status?. does condition status should contain
> > information about multiple jobs?.
> >
> > Please let me know your inputs and suggestions.
> >
> >
> > Regards
> >
> > Lajith
> >
> >
> > On Fri, Jun 7, 2024 at 10:25 AM Lajith Koova <[email protected]>
> wrote:
> >
> > > Thank you Gyula for the feedback.
> > >
> > > From the above proposed conditions, so will be having  two conditions
> as
> > > below
> > >
> > > status:
> > > conditions:
> > > - type: JobReady
> > > message: The Job is running
> > > reason: running
> > > status: 'True'
> > > lastTransitionTime: ''
> > > - type: ReconciliationSucceed
> > > message: The resource deployment is considered to be stable and won’t
> be
> > > rolled back
> > > reason: stable
> > > status: 'True'
> > > lastTransitionTime: ''
> > >
> > >
> > > Condition JobReady is derived from JobStatus  and Condition
> > ReconciliationSucceed
> > > derived from LifecycleState.
> > >
> > > Please correct me if I missed anything.
> > >
> > >
> > > Thanks
> > > Lajith K
> > >
> > > On Thu, May 30, 2024 at 2:23 PM Gyula Fóra <[email protected]>
> wrote:
> > >
> > >> David,
> > >>
> > >> The problem is exactly that ResourceLifecycleStates do not correspond
> to
> > >> specific Job statuses (JobReady condition) in most cases. Let me give
> > you
> > >> a
> > >> concrete example:
> > >>
> > >> ResourceLifecycleState.STABLE means that app/job defined in the spec
> has
> > >> been successfully deployed and was observed running, and this spec is
> > now
> > >> considered to be stable (won't be rolled back). Once a resource
> > >> (FlinkDeployment) reached STABLE state, it won't change unless the
> user
> > >> changes the spec. At the same time, this doesn't really say anything
> > about
> > >> job health/readiness at any given future time. 10 minutes later the
> job
> > >> can
> > >> go in an unrecoverable failure loop and never reach a running status,
> > the
> > >> ResourceLifecycleState will remain STABLE.
> > >>
> > >> This is actually not a problem with the ResourceLifecycleState but
> more
> > >> with the understanding of it. It's called ResourceLifecycleState and
> not
> > >> JobState exactly because it refers to the upgrade/rollback/suspend etc
> > >> lifecycle of the FlinkDeployment/FlinkSessionJob resource and not the
> > >> underlying flink job itself.
> > >>
> > >> But this is a crucial detail here that we need to consider otherwise
> the
> > >> "Ready" condition that we may create will be practically useless.
> > >>
> > >> This is the reason why @[email protected] <[email protected]> and
> > >> I suggest separating this to at least 2 independent conditions. One
> > could
> > >> be the UpgradeCompleted/ReconciliationCompleted or something along
> these
> > >> lines computed based on LifecycleState (as described in your proposal
> > but
> > >> with a different name). The other should be JobReady which could
> > initially
> > >> work based on the JobStatus.state field but ideally would be user
> > >> configurable ready condition such as (job running at least 10 minutes,
> > >> running and have taken checkpoints etcetc).
> > >>
> > >> These 2 conditions should be enough to start with and would actually
> > >> provide a tangible value to users. We can probably leave out
> > ClusterReady
> > >> on a second thought.
> > >>
> > >> Cheers,
> > >> Gyula
> > >>
> > >>
> > >> On Wed, May 29, 2024 at 5:16 PM David Radley <[email protected]
> >
> > >> wrote:
> > >>
> > >> > Hi Gyula,
> > >> > Thank you for the quick response and confirmation we need a Flip. I
> am
> > >> not
> > >> > an expert at K8s, Lajith will answer in more detail. Some questions
> I
> > >> had
> > >> > anyway:
> > >> >
> > >> > I assume each of the ResourceLifecycleState do have a corresponding
> > >> > jobReady status. You point out some mistakes in the table, for
> example
> > >> that
> > >> > STABLE should be NotReady; thankyou.  If we put a reason mentioning
> > the
> > >> > stable state, this would help us understand the jobStatus.
> > >> >
> > >> > I guess the jobReady is one perspective that we know is useful (with
> > >> > corrected  mappings from ResourceLifecycleState and with reasons).
> > Can I
> > >> > check that the  2 proposed conditions would also be useful
> additions?
> > I
> > >> > assume that in your proposal  when jobReady is true, then
> > >> UpgradeCompleted
> > >> > condition would not be present and ClusterReady would always be
> true?
> > I
> > >> > know conditions do not need to be orthogonal, but I wanted to check
> > what
> > >> > your thoughts are.
> > >> >
> > >> >     Kind regards, David.
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > From: Gyula Fóra <[email protected]>
> > >> > Date: Wednesday, 29 May 2024 at 15:28
> > >> > To: [email protected] <[email protected]>
> > >> > Cc: [email protected] <[email protected]>
> > >> > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Add K8S conditions to
> Flink
> > >> CRD
> > >> > Hi David!
> > >> >
> > >> > This change definitely warrants a FLIP even if the code change is
> not
> > >> huge,
> > >> > there are quite some implications going forward.
> > >> >
> > >> > Looping in @[email protected] <[email protected]> for this
> > >> discussion.
> > >> >
> > >> > I have some questions / suggestions regarding the condition's
> meaning
> > >> and
> > >> > naming.
> > >> >
> > >> > In your proposal you have:
> > >> >  - Ready (True/False) -> This condition is intended for resources
> > which
> > >> are
> > >> > fully ready and operational
> > >> >  - Error (True) -> This condition can be used in scenarios where any
> > >> > exception/error during resource reconcile process
> > >> >
> > >> > The problem with the above is that the implementation does not well
> > >> reflect
> > >> > this. ResourceLifecycleState STABLE/ROLLED_BACK does not actually
> mean
> > >> the
> > >> > job is running, it just means that the resource is fully reconciled
> > and
> > >> it
> > >> > will not be rolled back (so the current pending upgrade is
> completed).
> > >> This
> > >> > is mainly a fault of the ResourceLifecycleState as it doesn't
> capture
> > >> the
> > >> > job status but one could argue that it was "designed" this way.
> > >> >
> > >> > I think we should probably have more condition types to capture the
> > >> > difference:
> > >> >  - JobReady (True/False) -> Flink job is running (Basically job
> status
> > >> but
> > >> > with transition time)
> > >> >  - ClusterReady (True/False) -> Session / Application cluster is
> > >> deployed
> > >> > (Basically JM deployment status but with transition time)
> > >> > -  UpgradeCompleted (True/False) -> Similar to what you call Ready
> now
> > >> > which should correspond to the STABLE/ROLLED_BACK states and mostly
> > >> tracks
> > >> > in-progress CR updates
> > >> >
> > >> > This is my best idea at the moment, not great as it feels a little
> > >> > redundant with the current status fields. But maybe thats not a
> > problem
> > >> or
> > >> > a way to eliminate the old fields later?
> > >> >
> > >> > I am not so sure of the Error status and what this means in
> practice.
> > >> Why
> > >> > do we want to track the last error in 2 places? It's already in the
> > >> status.
> > >> >
> > >> > What do you think?
> > >> > Gyula
> > >> >
> > >> > On Wed, May 29, 2024 at 3:55 PM David Radley <
> [email protected]
> > >
> > >> > wrote:
> > >> >
> > >> > > Hi,
> > >> > > Thanks Lajith for raising this discussion thread under the Flip
> > title.
> > >> > >
> > >> > > To summarise the concerns from the other discussion thread.
> > >> > >
> > >> > > “
> > >> > > - I echo Gyula that including some examples and further
> explanations
> > >> > might
> > >> > > ease reader's work. With the current version, the FLIP is a bit
> hard
> > >> to
> > >> > > follow. - Will the usage of Conditions be enabled by default? Or
> > will
> > >> > there
> > >> > > be any disadvantages for Flink users? If Conditions with the same
> > type
> > >> > > already exist in the Status Conditions
> > >> > >
> > >> > > - Do you think we should have clear rules about handling rules for
> > how
> > >> > > these Conditions should be managed, especially when multiple
> > >> Conditions
> > >> > of
> > >> > > the same type are present? For example, resource has multiple
> causes
> > >> for
> > >> > > the same condition (e.g., Error due to network and Error due to
> > I/O).
> > >> > Then,
> > >> > > overriding the old condition with the new one is not the best
> > approach
> > >> > no?
> > >> > > Please correct me if I misunderstood.
> > >> > > “
> > >> > >
> > >> > > I see the Google doc link has been reformatted to match the Flip
> > >> > template.
> > >> > >
> > >> > > To explicitly answer the questions from Jeyhun and Gyula:
> > >> > > - “Will the usage of Conditions be enabled by default?” Yes, but
> > this
> > >> is
> > >> > > just making the status content useful, whereas before it was not
> > >> useful.
> > >> > > - in terms of examples, I am not sure what you would like to see,
> > the
> > >> > > table Lajith provided shows the status for various
> > >> > ResourceLifecycleStates.
> > >> > > How the operator gets into these states is the current behaviour.
> > The
> > >> > > change just shows the appropriate corresponding high level status
> –
> > >> that
> > >> > > could be shown on the User Interfaces.
> > >> > > - “will there be any disadvantages for Flink users?” None , there
> is
> > >> just
> > >> > > more information in the status, without this it is more difficult
> to
> > >> work
> > >> > > out the status of the job.
> > >> > > - Multiple conditions question. The status is showing whether the
> > job
> > >> is
> > >> > > ready or not, so as long as the last condition is the one that is
> > >> shown -
> > >> > > all is as expected. I don’t think this needs rules for precedence
> > and
> > >> the
> > >> > > like.
> > >> > > - The condition’s Reason is going to be more specific.
> > >> > >
> > >> > > Gyula and Jeyhun, is the google doc clear enough for you now? Do
> you
> > >> feel
> > >> > > you feedback has been addressed? Lajith and I are happy to provide
> > >> more
> > >> > > details.
> > >> > >
> > >> > > I wonder whether this change is big enough to warrant a Flip, as
> it
> > >> is so
> > >> > > small. We could do this in an issue. WDYT?
> > >> > >
> > >> > > Kind regards, David.
> > >> > >
> > >> > >
> > >> > > From: Lajith Koova <[email protected]>
> > >> > > Date: Wednesday, 29 May 2024 at 13:41
> > >> > > To: [email protected] <[email protected]>
> > >> > > Subject: [EXTERNAL] [DISCUSS] FLIP-XXX Add K8S conditions to Flink
> > CRD
> > >> > > Hello ,
> > >> > >
> > >> > >
> > >> > > Discussion thread here:
> > >> > > https://lists.apache.org/thread/dvy8w17pyjv68c3t962w49frl9odoz4z
> > to
> > >> > > discuss a proposal to add Conditions field in the CR status of
> Flink
> > >> > > Deployment and FlinkSessionJob.
> > >> > >
> > >> > >
> > >> > > Note : Starting this new thread as discussion thread title has
> been
> > >> > > modified to follow the FLIP process.
> > >> > >
> > >> > >
> > >> > > Thank you.
> > >> > >
> > >> > > Unless otherwise stated above:
> > >> > >
> > >> > > IBM United Kingdom Limited
> > >> > > Registered in England and Wales with number 741598
> > >> > > Registered office: PO Box 41, North Harbour, Portsmouth, Hants.
> PO6
> > >> 3AU
> > >> > >
> > >> >
> > >> > Unless otherwise stated above:
> > >> >
> > >> > IBM United Kingdom Limited
> > >> > Registered in England and Wales with number 741598
> > >> > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6
> > 3AU
> > >> >
> > >>
> > >
> >
> > Unless otherwise stated above:
> >
> > IBM United Kingdom Limited
> > Registered in England and Wales with number 741598
> > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
> >
>

Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD

Reply via email to