Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD

Lajith Koova Tue, 24 Sep 2024 20:55:31 -0700

Hi David,

Sorry for confusion,


Regarding two points you mentioned below , please find the response

  *   I thought we have 1 condition per Flink job state, so I assume we
have one true condition and potentially other historical false ones.
          Right, we will have one condition per Job state , but not
maintain the historical state in status.condition

  *   When you say during transition, are you thinking of some small time
window between states. I am not sure what you are saying here.
         No. Not a small window between states , I meant by when the Flink
Job state transitioned from one state to another state. I have made an
update to FLIP to avoid further confusion.

Thanks
Lajith K
On Mon, Sep 23, 2024 at 3:46 PM David Radley <[email protected]>
wrote:

> Hi Lajith,
> The updated document is much more detailed and looks good. As you say the
> only situation that is not handled currently is when there are multiple
> Flink jobs running in Application Mode.
>
> As discussed , you are looking to test this situation so we know how it
> will perform.
>
> When you say “During transition of Job state, there will be only one
> condition for the
> Flink Deployment in application mode.”. I am not sure I understand.
>
>   *   I thought we have 1 condition per Flink job state, so I assume we
> have one true condition and potentially other historical false ones.
>   *   When you say during transition, are you thinking of some small time
> window between states. I am not sure what you are saying here.
>
>
> Kind regards , David
>
>
> From: Lajith Koova <[email protected]>
> Date: Wednesday, 11 September 2024 at 03:01
> To: [email protected] <[email protected]>
> Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD
> Hi,
>
> Here is the updated Proposal doc
> <
> https://docs.google.com/document/d/12wlJCL_Vq2KZnABzK7OR7gAd1jZMmo0MxgXQXqtWODs/edit#heading=h.cz8x5nsncuwb
> >
> .
>
> *Summary : *
>
> Session Mode:
>
>    Status conditions will be populated with status of Job manager.
>
> Application Mode:
>
> 1. In application mode , status conditions will be populated with status of
> Job running in the cluster.
>
> 2. Each Flink Job state will have one condition associated with.
>
> 3. During transition of Job state, there will be only one condition for the
> Flink Deployment in application mode.
>
> 4. If there are multiple Jobs in application, how to handle them in
> populating the condition status?. does condition status should contain
> information about multiple jobs?.
>
> Please let me know your inputs and suggestions.
>
>
> Regards
>
> Lajith
>
>
> On Fri, Jun 7, 2024 at 10:25 AM Lajith Koova <[email protected]> wrote:
>
> > Thank you Gyula for the feedback.
> >
> > From the above proposed conditions, so will be having  two conditions as
> > below
> >
> > status:
> > conditions:
> > - type: JobReady
> > message: The Job is running
> > reason: running
> > status: 'True'
> > lastTransitionTime: ''
> > - type: ReconciliationSucceed
> > message: The resource deployment is considered to be stable and won’t be
> > rolled back
> > reason: stable
> > status: 'True'
> > lastTransitionTime: ''
> >
> >
> > Condition JobReady is derived from JobStatus  and Condition
> ReconciliationSucceed
> > derived from LifecycleState.
> >
> > Please correct me if I missed anything.
> >
> >
> > Thanks
> > Lajith K
> >
> > On Thu, May 30, 2024 at 2:23 PM Gyula Fóra <[email protected]> wrote:
> >
> >> David,
> >>
> >> The problem is exactly that ResourceLifecycleStates do not correspond to
> >> specific Job statuses (JobReady condition) in most cases. Let me give
> you
> >> a
> >> concrete example:
> >>
> >> ResourceLifecycleState.STABLE means that app/job defined in the spec has
> >> been successfully deployed and was observed running, and this spec is
> now
> >> considered to be stable (won't be rolled back). Once a resource
> >> (FlinkDeployment) reached STABLE state, it won't change unless the user
> >> changes the spec. At the same time, this doesn't really say anything
> about
> >> job health/readiness at any given future time. 10 minutes later the job
> >> can
> >> go in an unrecoverable failure loop and never reach a running status,
> the
> >> ResourceLifecycleState will remain STABLE.
> >>
> >> This is actually not a problem with the ResourceLifecycleState but more
> >> with the understanding of it. It's called ResourceLifecycleState and not
> >> JobState exactly because it refers to the upgrade/rollback/suspend etc
> >> lifecycle of the FlinkDeployment/FlinkSessionJob resource and not the
> >> underlying flink job itself.
> >>
> >> But this is a crucial detail here that we need to consider otherwise the
> >> "Ready" condition that we may create will be practically useless.
> >>
> >> This is the reason why @[email protected] <[email protected]> and
> >> I suggest separating this to at least 2 independent conditions. One
> could
> >> be the UpgradeCompleted/ReconciliationCompleted or something along these
> >> lines computed based on LifecycleState (as described in your proposal
> but
> >> with a different name). The other should be JobReady which could
> initially
> >> work based on the JobStatus.state field but ideally would be user
> >> configurable ready condition such as (job running at least 10 minutes,
> >> running and have taken checkpoints etcetc).
> >>
> >> These 2 conditions should be enough to start with and would actually
> >> provide a tangible value to users. We can probably leave out
> ClusterReady
> >> on a second thought.
> >>
> >> Cheers,
> >> Gyula
> >>
> >>
> >> On Wed, May 29, 2024 at 5:16 PM David Radley <[email protected]>
> >> wrote:
> >>
> >> > Hi Gyula,
> >> > Thank you for the quick response and confirmation we need a Flip. I am
> >> not
> >> > an expert at K8s, Lajith will answer in more detail. Some questions I
> >> had
> >> > anyway:
> >> >
> >> > I assume each of the ResourceLifecycleState do have a corresponding
> >> > jobReady status. You point out some mistakes in the table, for example
> >> that
> >> > STABLE should be NotReady; thankyou.  If we put a reason mentioning
> the
> >> > stable state, this would help us understand the jobStatus.
> >> >
> >> > I guess the jobReady is one perspective that we know is useful (with
> >> > corrected  mappings from ResourceLifecycleState and with reasons).
> Can I
> >> > check that the  2 proposed conditions would also be useful additions?
> I
> >> > assume that in your proposal  when jobReady is true, then
> >> UpgradeCompleted
> >> > condition would not be present and ClusterReady would always be true?
> I
> >> > know conditions do not need to be orthogonal, but I wanted to check
> what
> >> > your thoughts are.
> >> >
> >> >     Kind regards, David.
> >> >
> >> >
> >> >
> >> >
> >> > From: Gyula Fóra <[email protected]>
> >> > Date: Wednesday, 29 May 2024 at 15:28
> >> > To: [email protected] <[email protected]>
> >> > Cc: [email protected] <[email protected]>
> >> > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink
> >> CRD
> >> > Hi David!
> >> >
> >> > This change definitely warrants a FLIP even if the code change is not
> >> huge,
> >> > there are quite some implications going forward.
> >> >
> >> > Looping in @[email protected] <[email protected]> for this
> >> discussion.
> >> >
> >> > I have some questions / suggestions regarding the condition's meaning
> >> and
> >> > naming.
> >> >
> >> > In your proposal you have:
> >> >  - Ready (True/False) -> This condition is intended for resources
> which
> >> are
> >> > fully ready and operational
> >> >  - Error (True) -> This condition can be used in scenarios where any
> >> > exception/error during resource reconcile process
> >> >
> >> > The problem with the above is that the implementation does not well
> >> reflect
> >> > this. ResourceLifecycleState STABLE/ROLLED_BACK does not actually mean
> >> the
> >> > job is running, it just means that the resource is fully reconciled
> and
> >> it
> >> > will not be rolled back (so the current pending upgrade is completed).
> >> This
> >> > is mainly a fault of the ResourceLifecycleState as it doesn't capture
> >> the
> >> > job status but one could argue that it was "designed" this way.
> >> >
> >> > I think we should probably have more condition types to capture the
> >> > difference:
> >> >  - JobReady (True/False) -> Flink job is running (Basically job status
> >> but
> >> > with transition time)
> >> >  - ClusterReady (True/False) -> Session / Application cluster is
> >> deployed
> >> > (Basically JM deployment status but with transition time)
> >> > -  UpgradeCompleted (True/False) -> Similar to what you call Ready now
> >> > which should correspond to the STABLE/ROLLED_BACK states and mostly
> >> tracks
> >> > in-progress CR updates
> >> >
> >> > This is my best idea at the moment, not great as it feels a little
> >> > redundant with the current status fields. But maybe thats not a
> problem
> >> or
> >> > a way to eliminate the old fields later?
> >> >
> >> > I am not so sure of the Error status and what this means in practice.
> >> Why
> >> > do we want to track the last error in 2 places? It's already in the
> >> status.
> >> >
> >> > What do you think?
> >> > Gyula
> >> >
> >> > On Wed, May 29, 2024 at 3:55 PM David Radley <[email protected]
> >
> >> > wrote:
> >> >
> >> > > Hi,
> >> > > Thanks Lajith for raising this discussion thread under the Flip
> title.
> >> > >
> >> > > To summarise the concerns from the other discussion thread.
> >> > >
> >> > > “
> >> > > - I echo Gyula that including some examples and further explanations
> >> > might
> >> > > ease reader's work. With the current version, the FLIP is a bit hard
> >> to
> >> > > follow. - Will the usage of Conditions be enabled by default? Or
> will
> >> > there
> >> > > be any disadvantages for Flink users? If Conditions with the same
> type
> >> > > already exist in the Status Conditions
> >> > >
> >> > > - Do you think we should have clear rules about handling rules for
> how
> >> > > these Conditions should be managed, especially when multiple
> >> Conditions
> >> > of
> >> > > the same type are present? For example, resource has multiple causes
> >> for
> >> > > the same condition (e.g., Error due to network and Error due to
> I/O).
> >> > Then,
> >> > > overriding the old condition with the new one is not the best
> approach
> >> > no?
> >> > > Please correct me if I misunderstood.
> >> > > “
> >> > >
> >> > > I see the Google doc link has been reformatted to match the Flip
> >> > template.
> >> > >
> >> > > To explicitly answer the questions from Jeyhun and Gyula:
> >> > > - “Will the usage of Conditions be enabled by default?” Yes, but
> this
> >> is
> >> > > just making the status content useful, whereas before it was not
> >> useful.
> >> > > - in terms of examples, I am not sure what you would like to see,
> the
> >> > > table Lajith provided shows the status for various
> >> > ResourceLifecycleStates.
> >> > > How the operator gets into these states is the current behaviour.
> The
> >> > > change just shows the appropriate corresponding high level status –
> >> that
> >> > > could be shown on the User Interfaces.
> >> > > - “will there be any disadvantages for Flink users?” None , there is
> >> just
> >> > > more information in the status, without this it is more difficult to
> >> work
> >> > > out the status of the job.
> >> > > - Multiple conditions question. The status is showing whether the
> job
> >> is
> >> > > ready or not, so as long as the last condition is the one that is
> >> shown -
> >> > > all is as expected. I don’t think this needs rules for precedence
> and
> >> the
> >> > > like.
> >> > > - The condition’s Reason is going to be more specific.
> >> > >
> >> > > Gyula and Jeyhun, is the google doc clear enough for you now? Do you
> >> feel
> >> > > you feedback has been addressed? Lajith and I are happy to provide
> >> more
> >> > > details.
> >> > >
> >> > > I wonder whether this change is big enough to warrant a Flip, as it
> >> is so
> >> > > small. We could do this in an issue. WDYT?
> >> > >
> >> > > Kind regards, David.
> >> > >
> >> > >
> >> > > From: Lajith Koova <[email protected]>
> >> > > Date: Wednesday, 29 May 2024 at 13:41
> >> > > To: [email protected] <[email protected]>
> >> > > Subject: [EXTERNAL] [DISCUSS] FLIP-XXX Add K8S conditions to Flink
> CRD
> >> > > Hello ,
> >> > >
> >> > >
> >> > > Discussion thread here:
> >> > > https://lists.apache.org/thread/dvy8w17pyjv68c3t962w49frl9odoz4z
> to
> >> > > discuss a proposal to add Conditions field in the CR status of Flink
> >> > > Deployment and FlinkSessionJob.
> >> > >
> >> > >
> >> > > Note : Starting this new thread as discussion thread title has been
> >> > > modified to follow the FLIP process.
> >> > >
> >> > >
> >> > > Thank you.
> >> > >
> >> > > Unless otherwise stated above:
> >> > >
> >> > > IBM United Kingdom Limited
> >> > > Registered in England and Wales with number 741598
> >> > > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6
> >> 3AU
> >> > >
> >> >
> >> > Unless otherwise stated above:
> >> >
> >> > IBM United Kingdom Limited
> >> > Registered in England and Wales with number 741598
> >> > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6
> 3AU
> >> >
> >>
> >
>
> Unless otherwise stated above:
>
> IBM United Kingdom Limited
> Registered in England and Wales with number 741598
> Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
>

Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD

Reply via email to