Hi David, Sorry for confusion,
Regarding two points you mentioned below , please find the response * I thought we have 1 condition per Flink job state, so I assume we have one true condition and potentially other historical false ones. Right, we will have one condition per Job state , but not maintain the historical state in status.condition * When you say during transition, are you thinking of some small time window between states. I am not sure what you are saying here. No. Not a small window between states , I meant by when the Flink Job state transitioned from one state to another state. I have made an update to FLIP to avoid further confusion. Thanks Lajith K On Mon, Sep 23, 2024 at 3:46 PM David Radley <david_rad...@uk.ibm.com> wrote: > Hi Lajith, > The updated document is much more detailed and looks good. As you say the > only situation that is not handled currently is when there are multiple > Flink jobs running in Application Mode. > > As discussed , you are looking to test this situation so we know how it > will perform. > > When you say “During transition of Job state, there will be only one > condition for the > Flink Deployment in application mode.”. I am not sure I understand. > > * I thought we have 1 condition per Flink job state, so I assume we > have one true condition and potentially other historical false ones. > * When you say during transition, are you thinking of some small time > window between states. I am not sure what you are saying here. > > > Kind regards , David > > > From: Lajith Koova <lajith...@gmail.com> > Date: Wednesday, 11 September 2024 at 03:01 > To: dev@flink.apache.org <dev@flink.apache.org> > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink CRD > Hi, > > Here is the updated Proposal doc > < > https://docs.google.com/document/d/12wlJCL_Vq2KZnABzK7OR7gAd1jZMmo0MxgXQXqtWODs/edit#heading=h.cz8x5nsncuwb > > > . > > *Summary : * > > Session Mode: > > Status conditions will be populated with status of Job manager. > > Application Mode: > > 1. In application mode , status conditions will be populated with status of > Job running in the cluster. > > 2. Each Flink Job state will have one condition associated with. > > 3. During transition of Job state, there will be only one condition for the > Flink Deployment in application mode. > > 4. If there are multiple Jobs in application, how to handle them in > populating the condition status?. does condition status should contain > information about multiple jobs?. > > Please let me know your inputs and suggestions. > > > Regards > > Lajith > > > On Fri, Jun 7, 2024 at 10:25 AM Lajith Koova <lajith...@gmail.com> wrote: > > > Thank you Gyula for the feedback. > > > > From the above proposed conditions, so will be having two conditions as > > below > > > > status: > > conditions: > > - type: JobReady > > message: The Job is running > > reason: running > > status: 'True' > > lastTransitionTime: '' > > - type: ReconciliationSucceed > > message: The resource deployment is considered to be stable and won’t be > > rolled back > > reason: stable > > status: 'True' > > lastTransitionTime: '' > > > > > > Condition JobReady is derived from JobStatus and Condition > ReconciliationSucceed > > derived from LifecycleState. > > > > Please correct me if I missed anything. > > > > > > Thanks > > Lajith K > > > > On Thu, May 30, 2024 at 2:23 PM Gyula Fóra <gyula.f...@gmail.com> wrote: > > > >> David, > >> > >> The problem is exactly that ResourceLifecycleStates do not correspond to > >> specific Job statuses (JobReady condition) in most cases. Let me give > you > >> a > >> concrete example: > >> > >> ResourceLifecycleState.STABLE means that app/job defined in the spec has > >> been successfully deployed and was observed running, and this spec is > now > >> considered to be stable (won't be rolled back). Once a resource > >> (FlinkDeployment) reached STABLE state, it won't change unless the user > >> changes the spec. At the same time, this doesn't really say anything > about > >> job health/readiness at any given future time. 10 minutes later the job > >> can > >> go in an unrecoverable failure loop and never reach a running status, > the > >> ResourceLifecycleState will remain STABLE. > >> > >> This is actually not a problem with the ResourceLifecycleState but more > >> with the understanding of it. It's called ResourceLifecycleState and not > >> JobState exactly because it refers to the upgrade/rollback/suspend etc > >> lifecycle of the FlinkDeployment/FlinkSessionJob resource and not the > >> underlying flink job itself. > >> > >> But this is a crucial detail here that we need to consider otherwise the > >> "Ready" condition that we may create will be practically useless. > >> > >> This is the reason why @morh...@apache.org <morh...@apache.org> and > >> I suggest separating this to at least 2 independent conditions. One > could > >> be the UpgradeCompleted/ReconciliationCompleted or something along these > >> lines computed based on LifecycleState (as described in your proposal > but > >> with a different name). The other should be JobReady which could > initially > >> work based on the JobStatus.state field but ideally would be user > >> configurable ready condition such as (job running at least 10 minutes, > >> running and have taken checkpoints etcetc). > >> > >> These 2 conditions should be enough to start with and would actually > >> provide a tangible value to users. We can probably leave out > ClusterReady > >> on a second thought. > >> > >> Cheers, > >> Gyula > >> > >> > >> On Wed, May 29, 2024 at 5:16 PM David Radley <david_rad...@uk.ibm.com> > >> wrote: > >> > >> > Hi Gyula, > >> > Thank you for the quick response and confirmation we need a Flip. I am > >> not > >> > an expert at K8s, Lajith will answer in more detail. Some questions I > >> had > >> > anyway: > >> > > >> > I assume each of the ResourceLifecycleState do have a corresponding > >> > jobReady status. You point out some mistakes in the table, for example > >> that > >> > STABLE should be NotReady; thankyou. If we put a reason mentioning > the > >> > stable state, this would help us understand the jobStatus. > >> > > >> > I guess the jobReady is one perspective that we know is useful (with > >> > corrected mappings from ResourceLifecycleState and with reasons). > Can I > >> > check that the 2 proposed conditions would also be useful additions? > I > >> > assume that in your proposal when jobReady is true, then > >> UpgradeCompleted > >> > condition would not be present and ClusterReady would always be true? > I > >> > know conditions do not need to be orthogonal, but I wanted to check > what > >> > your thoughts are. > >> > > >> > Kind regards, David. > >> > > >> > > >> > > >> > > >> > From: Gyula Fóra <gyula.f...@gmail.com> > >> > Date: Wednesday, 29 May 2024 at 15:28 > >> > To: dev@flink.apache.org <dev@flink.apache.org> > >> > Cc: morh...@apache.org <morh...@apache.org> > >> > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink > >> CRD > >> > Hi David! > >> > > >> > This change definitely warrants a FLIP even if the code change is not > >> huge, > >> > there are quite some implications going forward. > >> > > >> > Looping in @morh...@apache.org <morh...@apache.org> for this > >> discussion. > >> > > >> > I have some questions / suggestions regarding the condition's meaning > >> and > >> > naming. > >> > > >> > In your proposal you have: > >> > - Ready (True/False) -> This condition is intended for resources > which > >> are > >> > fully ready and operational > >> > - Error (True) -> This condition can be used in scenarios where any > >> > exception/error during resource reconcile process > >> > > >> > The problem with the above is that the implementation does not well > >> reflect > >> > this. ResourceLifecycleState STABLE/ROLLED_BACK does not actually mean > >> the > >> > job is running, it just means that the resource is fully reconciled > and > >> it > >> > will not be rolled back (so the current pending upgrade is completed). > >> This > >> > is mainly a fault of the ResourceLifecycleState as it doesn't capture > >> the > >> > job status but one could argue that it was "designed" this way. > >> > > >> > I think we should probably have more condition types to capture the > >> > difference: > >> > - JobReady (True/False) -> Flink job is running (Basically job status > >> but > >> > with transition time) > >> > - ClusterReady (True/False) -> Session / Application cluster is > >> deployed > >> > (Basically JM deployment status but with transition time) > >> > - UpgradeCompleted (True/False) -> Similar to what you call Ready now > >> > which should correspond to the STABLE/ROLLED_BACK states and mostly > >> tracks > >> > in-progress CR updates > >> > > >> > This is my best idea at the moment, not great as it feels a little > >> > redundant with the current status fields. But maybe thats not a > problem > >> or > >> > a way to eliminate the old fields later? > >> > > >> > I am not so sure of the Error status and what this means in practice. > >> Why > >> > do we want to track the last error in 2 places? It's already in the > >> status. > >> > > >> > What do you think? > >> > Gyula > >> > > >> > On Wed, May 29, 2024 at 3:55 PM David Radley <david_rad...@uk.ibm.com > > > >> > wrote: > >> > > >> > > Hi, > >> > > Thanks Lajith for raising this discussion thread under the Flip > title. > >> > > > >> > > To summarise the concerns from the other discussion thread. > >> > > > >> > > “ > >> > > - I echo Gyula that including some examples and further explanations > >> > might > >> > > ease reader's work. With the current version, the FLIP is a bit hard > >> to > >> > > follow. - Will the usage of Conditions be enabled by default? Or > will > >> > there > >> > > be any disadvantages for Flink users? If Conditions with the same > type > >> > > already exist in the Status Conditions > >> > > > >> > > - Do you think we should have clear rules about handling rules for > how > >> > > these Conditions should be managed, especially when multiple > >> Conditions > >> > of > >> > > the same type are present? For example, resource has multiple causes > >> for > >> > > the same condition (e.g., Error due to network and Error due to > I/O). > >> > Then, > >> > > overriding the old condition with the new one is not the best > approach > >> > no? > >> > > Please correct me if I misunderstood. > >> > > “ > >> > > > >> > > I see the Google doc link has been reformatted to match the Flip > >> > template. > >> > > > >> > > To explicitly answer the questions from Jeyhun and Gyula: > >> > > - “Will the usage of Conditions be enabled by default?” Yes, but > this > >> is > >> > > just making the status content useful, whereas before it was not > >> useful. > >> > > - in terms of examples, I am not sure what you would like to see, > the > >> > > table Lajith provided shows the status for various > >> > ResourceLifecycleStates. > >> > > How the operator gets into these states is the current behaviour. > The > >> > > change just shows the appropriate corresponding high level status – > >> that > >> > > could be shown on the User Interfaces. > >> > > - “will there be any disadvantages for Flink users?” None , there is > >> just > >> > > more information in the status, without this it is more difficult to > >> work > >> > > out the status of the job. > >> > > - Multiple conditions question. The status is showing whether the > job > >> is > >> > > ready or not, so as long as the last condition is the one that is > >> shown - > >> > > all is as expected. I don’t think this needs rules for precedence > and > >> the > >> > > like. > >> > > - The condition’s Reason is going to be more specific. > >> > > > >> > > Gyula and Jeyhun, is the google doc clear enough for you now? Do you > >> feel > >> > > you feedback has been addressed? Lajith and I are happy to provide > >> more > >> > > details. > >> > > > >> > > I wonder whether this change is big enough to warrant a Flip, as it > >> is so > >> > > small. We could do this in an issue. WDYT? > >> > > > >> > > Kind regards, David. > >> > > > >> > > > >> > > From: Lajith Koova <lajith...@gmail.com> > >> > > Date: Wednesday, 29 May 2024 at 13:41 > >> > > To: dev@flink.apache.org <dev@flink.apache.org> > >> > > Subject: [EXTERNAL] [DISCUSS] FLIP-XXX Add K8S conditions to Flink > CRD > >> > > Hello , > >> > > > >> > > > >> > > Discussion thread here: > >> > > https://lists.apache.org/thread/dvy8w17pyjv68c3t962w49frl9odoz4z > to > >> > > discuss a proposal to add Conditions field in the CR status of Flink > >> > > Deployment and FlinkSessionJob. > >> > > > >> > > > >> > > Note : Starting this new thread as discussion thread title has been > >> > > modified to follow the FLIP process. > >> > > > >> > > > >> > > Thank you. > >> > > > >> > > Unless otherwise stated above: > >> > > > >> > > IBM United Kingdom Limited > >> > > Registered in England and Wales with number 741598 > >> > > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 > >> 3AU > >> > > > >> > > >> > Unless otherwise stated above: > >> > > >> > IBM United Kingdom Limited > >> > Registered in England and Wales with number 741598 > >> > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 > 3AU > >> > > >> > > > > Unless otherwise stated above: > > IBM United Kingdom Limited > Registered in England and Wales with number 741598 > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU >