Thank you Gyula for the feedback. >From the above proposed conditions, so will be having two conditions as below
status: conditions: - type: JobReady message: The Job is running reason: running status: 'True' lastTransitionTime: '' - type: ReconciliationSucceed message: The resource deployment is considered to be stable and won’t be rolled back reason: stable status: 'True' lastTransitionTime: '' Condition JobReady is derived from JobStatus and Condition ReconciliationSucceed derived from LifecycleState. Please correct me if I missed anything. Thanks Lajith K On Thu, May 30, 2024 at 2:23 PM Gyula Fóra <gyula.f...@gmail.com> wrote: > David, > > The problem is exactly that ResourceLifecycleStates do not correspond to > specific Job statuses (JobReady condition) in most cases. Let me give you a > concrete example: > > ResourceLifecycleState.STABLE means that app/job defined in the spec has > been successfully deployed and was observed running, and this spec is now > considered to be stable (won't be rolled back). Once a resource > (FlinkDeployment) reached STABLE state, it won't change unless the user > changes the spec. At the same time, this doesn't really say anything about > job health/readiness at any given future time. 10 minutes later the job can > go in an unrecoverable failure loop and never reach a running status, the > ResourceLifecycleState will remain STABLE. > > This is actually not a problem with the ResourceLifecycleState but more > with the understanding of it. It's called ResourceLifecycleState and not > JobState exactly because it refers to the upgrade/rollback/suspend etc > lifecycle of the FlinkDeployment/FlinkSessionJob resource and not the > underlying flink job itself. > > But this is a crucial detail here that we need to consider otherwise the > "Ready" condition that we may create will be practically useless. > > This is the reason why @morh...@apache.org <morh...@apache.org> and > I suggest separating this to at least 2 independent conditions. One could > be the UpgradeCompleted/ReconciliationCompleted or something along these > lines computed based on LifecycleState (as described in your proposal but > with a different name). The other should be JobReady which could initially > work based on the JobStatus.state field but ideally would be user > configurable ready condition such as (job running at least 10 minutes, > running and have taken checkpoints etcetc). > > These 2 conditions should be enough to start with and would actually > provide a tangible value to users. We can probably leave out ClusterReady > on a second thought. > > Cheers, > Gyula > > > On Wed, May 29, 2024 at 5:16 PM David Radley <david_rad...@uk.ibm.com> > wrote: > > > Hi Gyula, > > Thank you for the quick response and confirmation we need a Flip. I am > not > > an expert at K8s, Lajith will answer in more detail. Some questions I had > > anyway: > > > > I assume each of the ResourceLifecycleState do have a corresponding > > jobReady status. You point out some mistakes in the table, for example > that > > STABLE should be NotReady; thankyou. If we put a reason mentioning the > > stable state, this would help us understand the jobStatus. > > > > I guess the jobReady is one perspective that we know is useful (with > > corrected mappings from ResourceLifecycleState and with reasons). Can I > > check that the 2 proposed conditions would also be useful additions? I > > assume that in your proposal when jobReady is true, then > UpgradeCompleted > > condition would not be present and ClusterReady would always be true? I > > know conditions do not need to be orthogonal, but I wanted to check what > > your thoughts are. > > > > Kind regards, David. > > > > > > > > > > From: Gyula Fóra <gyula.f...@gmail.com> > > Date: Wednesday, 29 May 2024 at 15:28 > > To: dev@flink.apache.org <dev@flink.apache.org> > > Cc: morh...@apache.org <morh...@apache.org> > > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink > CRD > > Hi David! > > > > This change definitely warrants a FLIP even if the code change is not > huge, > > there are quite some implications going forward. > > > > Looping in @morh...@apache.org <morh...@apache.org> for this discussion. > > > > I have some questions / suggestions regarding the condition's meaning and > > naming. > > > > In your proposal you have: > > - Ready (True/False) -> This condition is intended for resources which > are > > fully ready and operational > > - Error (True) -> This condition can be used in scenarios where any > > exception/error during resource reconcile process > > > > The problem with the above is that the implementation does not well > reflect > > this. ResourceLifecycleState STABLE/ROLLED_BACK does not actually mean > the > > job is running, it just means that the resource is fully reconciled and > it > > will not be rolled back (so the current pending upgrade is completed). > This > > is mainly a fault of the ResourceLifecycleState as it doesn't capture the > > job status but one could argue that it was "designed" this way. > > > > I think we should probably have more condition types to capture the > > difference: > > - JobReady (True/False) -> Flink job is running (Basically job status > but > > with transition time) > > - ClusterReady (True/False) -> Session / Application cluster is deployed > > (Basically JM deployment status but with transition time) > > - UpgradeCompleted (True/False) -> Similar to what you call Ready now > > which should correspond to the STABLE/ROLLED_BACK states and mostly > tracks > > in-progress CR updates > > > > This is my best idea at the moment, not great as it feels a little > > redundant with the current status fields. But maybe thats not a problem > or > > a way to eliminate the old fields later? > > > > I am not so sure of the Error status and what this means in practice. Why > > do we want to track the last error in 2 places? It's already in the > status. > > > > What do you think? > > Gyula > > > > On Wed, May 29, 2024 at 3:55 PM David Radley <david_rad...@uk.ibm.com> > > wrote: > > > > > Hi, > > > Thanks Lajith for raising this discussion thread under the Flip title. > > > > > > To summarise the concerns from the other discussion thread. > > > > > > “ > > > - I echo Gyula that including some examples and further explanations > > might > > > ease reader's work. With the current version, the FLIP is a bit hard to > > > follow. - Will the usage of Conditions be enabled by default? Or will > > there > > > be any disadvantages for Flink users? If Conditions with the same type > > > already exist in the Status Conditions > > > > > > - Do you think we should have clear rules about handling rules for how > > > these Conditions should be managed, especially when multiple Conditions > > of > > > the same type are present? For example, resource has multiple causes > for > > > the same condition (e.g., Error due to network and Error due to I/O). > > Then, > > > overriding the old condition with the new one is not the best approach > > no? > > > Please correct me if I misunderstood. > > > “ > > > > > > I see the Google doc link has been reformatted to match the Flip > > template. > > > > > > To explicitly answer the questions from Jeyhun and Gyula: > > > - “Will the usage of Conditions be enabled by default?” Yes, but this > is > > > just making the status content useful, whereas before it was not > useful. > > > - in terms of examples, I am not sure what you would like to see, the > > > table Lajith provided shows the status for various > > ResourceLifecycleStates. > > > How the operator gets into these states is the current behaviour. The > > > change just shows the appropriate corresponding high level status – > that > > > could be shown on the User Interfaces. > > > - “will there be any disadvantages for Flink users?” None , there is > just > > > more information in the status, without this it is more difficult to > work > > > out the status of the job. > > > - Multiple conditions question. The status is showing whether the job > is > > > ready or not, so as long as the last condition is the one that is > shown - > > > all is as expected. I don’t think this needs rules for precedence and > the > > > like. > > > - The condition’s Reason is going to be more specific. > > > > > > Gyula and Jeyhun, is the google doc clear enough for you now? Do you > feel > > > you feedback has been addressed? Lajith and I are happy to provide more > > > details. > > > > > > I wonder whether this change is big enough to warrant a Flip, as it is > so > > > small. We could do this in an issue. WDYT? > > > > > > Kind regards, David. > > > > > > > > > From: Lajith Koova <lajith...@gmail.com> > > > Date: Wednesday, 29 May 2024 at 13:41 > > > To: dev@flink.apache.org <dev@flink.apache.org> > > > Subject: [EXTERNAL] [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD > > > Hello , > > > > > > > > > Discussion thread here: > > > https://lists.apache.org/thread/dvy8w17pyjv68c3t962w49frl9odoz4z to > > > discuss a proposal to add Conditions field in the CR status of Flink > > > Deployment and FlinkSessionJob. > > > > > > > > > Note : Starting this new thread as discussion thread title has been > > > modified to follow the FLIP process. > > > > > > > > > Thank you. > > > > > > Unless otherwise stated above: > > > > > > IBM United Kingdom Limited > > > Registered in England and Wales with number 741598 > > > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU > > > > > > > Unless otherwise stated above: > > > > IBM United Kingdom Limited > > Registered in England and Wales with number 741598 > > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU > > >