Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD

Lajith Koova Thu, 06 Jun 2024 21:55:37 -0700

Thank you Gyula for the feedback.

>From the above proposed conditions, so will be having  two conditions as
below


status:
conditions:
- type: JobReady
message: The Job is running
reason: running
status: 'True'
lastTransitionTime: ''
- type: ReconciliationSucceed
message: The resource deployment is considered to be stable and won’t be
rolled back
reason: stable
status: 'True'
lastTransitionTime: ''


Condition JobReady is derived from JobStatus  and Condition
ReconciliationSucceed
derived from LifecycleState.

Please correct me if I missed anything.


Thanks
Lajith K

On Thu, May 30, 2024 at 2:23 PM Gyula Fóra <gyula.f...@gmail.com> wrote:

> David,
>
> The problem is exactly that ResourceLifecycleStates do not correspond to
> specific Job statuses (JobReady condition) in most cases. Let me give you a
> concrete example:
>
> ResourceLifecycleState.STABLE means that app/job defined in the spec has
> been successfully deployed and was observed running, and this spec is now
> considered to be stable (won't be rolled back). Once a resource
> (FlinkDeployment) reached STABLE state, it won't change unless the user
> changes the spec. At the same time, this doesn't really say anything about
> job health/readiness at any given future time. 10 minutes later the job can
> go in an unrecoverable failure loop and never reach a running status, the
> ResourceLifecycleState will remain STABLE.
>
> This is actually not a problem with the ResourceLifecycleState but more
> with the understanding of it. It's called ResourceLifecycleState and not
> JobState exactly because it refers to the upgrade/rollback/suspend etc
> lifecycle of the FlinkDeployment/FlinkSessionJob resource and not the
> underlying flink job itself.
>
> But this is a crucial detail here that we need to consider otherwise the
> "Ready" condition that we may create will be practically useless.
>
> This is the reason why @morh...@apache.org <morh...@apache.org> and
> I suggest separating this to at least 2 independent conditions. One could
> be the UpgradeCompleted/ReconciliationCompleted or something along these
> lines computed based on LifecycleState (as described in your proposal but
> with a different name). The other should be JobReady which could initially
> work based on the JobStatus.state field but ideally would be user
> configurable ready condition such as (job running at least 10 minutes,
> running and have taken checkpoints etcetc).
>
> These 2 conditions should be enough to start with and would actually
> provide a tangible value to users. We can probably leave out ClusterReady
> on a second thought.
>
> Cheers,
> Gyula
>
>
> On Wed, May 29, 2024 at 5:16 PM David Radley <david_rad...@uk.ibm.com>
> wrote:
>
> > Hi Gyula,
> > Thank you for the quick response and confirmation we need a Flip. I am
> not
> > an expert at K8s, Lajith will answer in more detail. Some questions I had
> > anyway:
> >
> > I assume each of the ResourceLifecycleState do have a corresponding
> > jobReady status. You point out some mistakes in the table, for example
> that
> > STABLE should be NotReady; thankyou.  If we put a reason mentioning the
> > stable state, this would help us understand the jobStatus.
> >
> > I guess the jobReady is one perspective that we know is useful (with
> > corrected  mappings from ResourceLifecycleState and with reasons). Can I
> > check that the  2 proposed conditions would also be useful additions? I
> > assume that in your proposal  when jobReady is true, then
> UpgradeCompleted
> > condition would not be present and ClusterReady would always be true? I
> > know conditions do not need to be orthogonal, but I wanted to check what
> > your thoughts are.
> >
> >     Kind regards, David.
> >
> >
> >
> >
> > From: Gyula Fóra <gyula.f...@gmail.com>
> > Date: Wednesday, 29 May 2024 at 15:28
> > To: dev@flink.apache.org <dev@flink.apache.org>
> > Cc: morh...@apache.org <morh...@apache.org>
> > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink
> CRD
> > Hi David!
> >
> > This change definitely warrants a FLIP even if the code change is not
> huge,
> > there are quite some implications going forward.
> >
> > Looping in @morh...@apache.org <morh...@apache.org> for this discussion.
> >
> > I have some questions / suggestions regarding the condition's meaning and
> > naming.
> >
> > In your proposal you have:
> >  - Ready (True/False) -> This condition is intended for resources which
> are
> > fully ready and operational
> >  - Error (True) -> This condition can be used in scenarios where any
> > exception/error during resource reconcile process
> >
> > The problem with the above is that the implementation does not well
> reflect
> > this. ResourceLifecycleState STABLE/ROLLED_BACK does not actually mean
> the
> > job is running, it just means that the resource is fully reconciled and
> it
> > will not be rolled back (so the current pending upgrade is completed).
> This
> > is mainly a fault of the ResourceLifecycleState as it doesn't capture the
> > job status but one could argue that it was "designed" this way.
> >
> > I think we should probably have more condition types to capture the
> > difference:
> >  - JobReady (True/False) -> Flink job is running (Basically job status
> but
> > with transition time)
> >  - ClusterReady (True/False) -> Session / Application cluster is deployed
> > (Basically JM deployment status but with transition time)
> > -  UpgradeCompleted (True/False) -> Similar to what you call Ready now
> > which should correspond to the STABLE/ROLLED_BACK states and mostly
> tracks
> > in-progress CR updates
> >
> > This is my best idea at the moment, not great as it feels a little
> > redundant with the current status fields. But maybe thats not a problem
> or
> > a way to eliminate the old fields later?
> >
> > I am not so sure of the Error status and what this means in practice. Why
> > do we want to track the last error in 2 places? It's already in the
> status.
> >
> > What do you think?
> > Gyula
> >
> > On Wed, May 29, 2024 at 3:55 PM David Radley <david_rad...@uk.ibm.com>
> > wrote:
> >
> > > Hi,
> > > Thanks Lajith for raising this discussion thread under the Flip title.
> > >
> > > To summarise the concerns from the other discussion thread.
> > >
> > > “
> > > - I echo Gyula that including some examples and further explanations
> > might
> > > ease reader's work. With the current version, the FLIP is a bit hard to
> > > follow. - Will the usage of Conditions be enabled by default? Or will
> > there
> > > be any disadvantages for Flink users? If Conditions with the same type
> > > already exist in the Status Conditions
> > >
> > > - Do you think we should have clear rules about handling rules for how
> > > these Conditions should be managed, especially when multiple Conditions
> > of
> > > the same type are present? For example, resource has multiple causes
> for
> > > the same condition (e.g., Error due to network and Error due to I/O).
> > Then,
> > > overriding the old condition with the new one is not the best approach
> > no?
> > > Please correct me if I misunderstood.
> > > “
> > >
> > > I see the Google doc link has been reformatted to match the Flip
> > template.
> > >
> > > To explicitly answer the questions from Jeyhun and Gyula:
> > > - “Will the usage of Conditions be enabled by default?” Yes, but this
> is
> > > just making the status content useful, whereas before it was not
> useful.
> > > - in terms of examples, I am not sure what you would like to see, the
> > > table Lajith provided shows the status for various
> > ResourceLifecycleStates.
> > > How the operator gets into these states is the current behaviour. The
> > > change just shows the appropriate corresponding high level status –
> that
> > > could be shown on the User Interfaces.
> > > - “will there be any disadvantages for Flink users?” None , there is
> just
> > > more information in the status, without this it is more difficult to
> work
> > > out the status of the job.
> > > - Multiple conditions question. The status is showing whether the job
> is
> > > ready or not, so as long as the last condition is the one that is
> shown -
> > > all is as expected. I don’t think this needs rules for precedence and
> the
> > > like.
> > > - The condition’s Reason is going to be more specific.
> > >
> > > Gyula and Jeyhun, is the google doc clear enough for you now? Do you
> feel
> > > you feedback has been addressed? Lajith and I are happy to provide more
> > > details.
> > >
> > > I wonder whether this change is big enough to warrant a Flip, as it is
> so
> > > small. We could do this in an issue. WDYT?
> > >
> > > Kind regards, David.
> > >
> > >
> > > From: Lajith Koova <lajith...@gmail.com>
> > > Date: Wednesday, 29 May 2024 at 13:41
> > > To: dev@flink.apache.org <dev@flink.apache.org>
> > > Subject: [EXTERNAL] [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD
> > > Hello ,
> > >
> > >
> > > Discussion thread here:
> > > https://lists.apache.org/thread/dvy8w17pyjv68c3t962w49frl9odoz4z  to
> > > discuss a proposal to add Conditions field in the CR status of Flink
> > > Deployment and FlinkSessionJob.
> > >
> > >
> > > Note : Starting this new thread as discussion thread title has been
> > > modified to follow the FLIP process.
> > >
> > >
> > > Thank you.
> > >
> > > Unless otherwise stated above:
> > >
> > > IBM United Kingdom Limited
> > > Registered in England and Wales with number 741598
> > > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
> > >
> >
> > Unless otherwise stated above:
> >
> > IBM United Kingdom Limited
> > Registered in England and Wales with number 741598
> > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
> >
>

Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD

Reply via email to