Hi Ferenc Could you please help to create a FLIP page if everything looks fine from the above discussion/google doc. Thanks in advance.
Regards Lajith K On Thu, Nov 7, 2024 at 12:02 PM Lajith Koova <lajith...@gmail.com> wrote: > Thank you so much, Ferenc , and regarding your observation , FLIP covers > the changes in FlinkDeployment CR which defines Flink Application and > Session cluster deployments, hence it is referred to as FlinkDeployment. > Thank you > > Regards > Lajith > > On Thu, Nov 7, 2024 at 12:46 AM Ferenc Csaky <ferenc.cs...@pm.me.invalid> > wrote: > >> Hi, >> >> I can help to create a FLIP page, from the gdoc, but one thing >> that I noteced is under "Session mode" both the text and the code >> snippets refer to "FlinkDeployment". I believe that should be >> "FlinkSessionJob". >> >> Best, >> Ferenc >> >> >> >> On Wednesday, November 6th, 2024 at 17:33, David Radley < >> david_rad...@uk.ibm.com> wrote: >> >> > >> > >> > Hi lajith, >> > Yes I like the simplicity of the current proposal. >> > >> > Hi Gyula, >> > The next stage is to assign a Flip number and move the content of the >> google doc into the flip wiki. Unfortunately, as we are not committers, we >> are not authorized to do either of these activities. Are you able to copy >> this over or get another committer to do this please; so we can get this >> moving. >> > >> > Kind regards, David. >> > >> > From: Lajith Koova lajith...@gmail.com >> > >> > Date: Monday, 14 October 2024 at 08:52 >> > To: dev@flink.apache.org dev@flink.apache.org >> > >> > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Add K8S conditions to Flink >> CRD >> > Thank you all for the valuable feedback . >> > >> > >> > Following the procedure outlined on the Flink Improvement Proposal >> > >> > Confluence page [1], we kindly ask the PMC/Committers to transfer the >> > >> > content from the Add K8S conditions to CRD's Status [2] and assign a >> > >> > FLIP Number for us, which we will use for voting. >> > >> > >> > [1] >> > >> > >> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals#FlinkImprovementProposals-Process >> > >> > [2] >> > >> > >> https://docs.google.com/document/d/12wlJCL_Vq2KZnABzK7OR7gAd1jZMmo0MxgXQXqtWODs/edit?tab=t.0 >> > >> > >> > Thanks >> > >> > Lajith >> > >> > On Mon, Sep 23, 2024 at 11:54 PM Gyula Fóra gyula.f...@gmail.com wrote: >> > >> > > Hey! >> > > >> > > I think the proposal is now simple enough : >> > > - Running condition for Applications / SessionJobs >> > > - Ready condition for Session clusters >> > > >> > > I think we should formalize this into a Flip page and start the vote >> on >> > > this from my side. >> > > The next step to consider is having an independent condition that >> captures >> > > the upgrade process itself (if a resource is fully upgraded / >> reconciled) >> > > >> > > Cheers, >> > > Gyula >> > > >> > > On Mon, Sep 23, 2024 at 12:16 PM David Radley david_rad...@uk.ibm.com >> > > wrote: >> > > >> > > > Hi Lajith, >> > > > The updated document is much more detailed and looks good. As you >> say the >> > > > only situation that is not handled currently is when there are >> multiple >> > > > Flink jobs running in Application Mode. >> > > > >> > > > As discussed , you are looking to test this situation so we know >> how it >> > > > will perform. >> > > > >> > > > When you say “During transition of Job state, there will be only one >> > > > condition for the >> > > > Flink Deployment in application mode.”. I am not sure I understand. >> > > > >> > > > * I thought we have 1 condition per Flink job state, so I assume we >> > > > have one true condition and potentially other historical false ones. >> > > > * When you say during transition, are you thinking of some small >> time >> > > > window between states. I am not sure what you are saying here. >> > > > >> > > > Kind regards , David >> > > > >> > > > From: Lajith Koova lajith...@gmail.com >> > > > Date: Wednesday, 11 September 2024 at 03:01 >> > > > To: dev@flink.apache.org dev@flink.apache.org >> > > > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Add K8S conditions to >> Flink >> > > > CRD >> > > > Hi, >> > > > >> > > > Here is the updated Proposal doc >> > > > < >> > > >> > > >> https://docs.google.com/document/d/12wlJCL_Vq2KZnABzK7OR7gAd1jZMmo0MxgXQXqtWODs/edit#heading=h.cz8x5nsncuwb >> > > >> > > > . >> > > > >> > > > *Summary : * >> > > > >> > > > Session Mode: >> > > > >> > > > Status conditions will be populated with status of Job manager. >> > > > >> > > > Application Mode: >> > > > >> > > > 1. In application mode , status conditions will be populated with >> status >> > > > of >> > > > Job running in the cluster. >> > > > >> > > > 2. Each Flink Job state will have one condition associated with. >> > > > >> > > > 3. During transition of Job state, there will be only one condition >> for >> > > > the >> > > > Flink Deployment in application mode. >> > > > >> > > > 4. If there are multiple Jobs in application, how to handle them in >> > > > populating the condition status?. does condition status should >> contain >> > > > information about multiple jobs?. >> > > > >> > > > Please let me know your inputs and suggestions. >> > > > >> > > > Regards >> > > > >> > > > Lajith >> > > > >> > > > On Fri, Jun 7, 2024 at 10:25 AM Lajith Koova lajith...@gmail.com >> > > > wrote: >> > > > >> > > > > Thank you Gyula for the feedback. >> > > > > >> > > > > From the above proposed conditions, so will be having two >> conditions >> > > > > as >> > > > > below >> > > > > >> > > > > status: >> > > > > conditions: >> > > > > - type: JobReady >> > > > > message: The Job is running >> > > > > reason: running >> > > > > status: 'True' >> > > > > lastTransitionTime: '' >> > > > > - type: ReconciliationSucceed >> > > > > message: The resource deployment is considered to be stable and >> won’t >> > > > > be >> > > > > rolled back >> > > > > reason: stable >> > > > > status: 'True' >> > > > > lastTransitionTime: '' >> > > > > >> > > > > Condition JobReady is derived from JobStatus and Condition >> > > > > ReconciliationSucceed >> > > > > derived from LifecycleState. >> > > > > >> > > > > Please correct me if I missed anything. >> > > > > >> > > > > Thanks >> > > > > Lajith K >> > > > > >> > > > > On Thu, May 30, 2024 at 2:23 PM Gyula Fóra gyula.f...@gmail.com >> > > > > wrote: >> > > > > >> > > > > > David, >> > > > > > >> > > > > > The problem is exactly that ResourceLifecycleStates do not >> correspond >> > > > > > to >> > > > > > specific Job statuses (JobReady condition) in most cases. Let >> me give >> > > > > > you >> > > > > > a >> > > > > > concrete example: >> > > > > > >> > > > > > ResourceLifecycleState.STABLE means that app/job defined in the >> spec >> > > > > > has >> > > > > > been successfully deployed and was observed running, and this >> spec is >> > > > > > now >> > > > > > considered to be stable (won't be rolled back). Once a resource >> > > > > > (FlinkDeployment) reached STABLE state, it won't change unless >> the >> > > > > > user >> > > > > > changes the spec. At the same time, this doesn't really say >> anything >> > > > > > about >> > > > > > job health/readiness at any given future time. 10 minutes later >> the >> > > > > > job >> > > > > > can >> > > > > > go in an unrecoverable failure loop and never reach a running >> status, >> > > > > > the >> > > > > > ResourceLifecycleState will remain STABLE. >> > > > > > >> > > > > > This is actually not a problem with the ResourceLifecycleState >> but >> > > > > > more >> > > > > > with the understanding of it. It's called >> ResourceLifecycleState and >> > > > > > not >> > > > > > JobState exactly because it refers to the >> upgrade/rollback/suspend etc >> > > > > > lifecycle of the FlinkDeployment/FlinkSessionJob resource and >> not the >> > > > > > underlying flink job itself. >> > > > > > >> > > > > > But this is a crucial detail here that we need to consider >> otherwise >> > > > > > the >> > > > > > "Ready" condition that we may create will be practically >> useless. >> > > > > > >> > > > > > This is the reason why @morh...@apache.org morh...@apache.org >> and >> > > > > > I suggest separating this to at least 2 independent conditions. >> One >> > > > > > could >> > > > > > be the UpgradeCompleted/ReconciliationCompleted or something >> along >> > > > > > these >> > > > > > lines computed based on LifecycleState (as described in your >> proposal >> > > > > > but >> > > > > > with a different name). The other should be JobReady which could >> > > > > > initially >> > > > > > work based on the JobStatus.state field but ideally would be >> user >> > > > > > configurable ready condition such as (job running at least 10 >> minutes, >> > > > > > running and have taken checkpoints etcetc). >> > > > > > >> > > > > > These 2 conditions should be enough to start with and would >> actually >> > > > > > provide a tangible value to users. We can probably leave out >> > > > > > ClusterReady >> > > > > > on a second thought. >> > > > > > >> > > > > > Cheers, >> > > > > > Gyula >> > > > > > >> > > > > > On Wed, May 29, 2024 at 5:16 PM David Radley < >> david_rad...@uk.ibm.com >> > > > >> > > > > > wrote: >> > > > > > >> > > > > > > Hi Gyula, >> > > > > > > Thank you for the quick response and confirmation we need a >> Flip. I >> > > > > > > am >> > > > > > > not >> > > > > > > an expert at K8s, Lajith will answer in more detail. Some >> questions >> > > > > > > I >> > > > > > > had >> > > > > > > anyway: >> > > > > > > >> > > > > > > I assume each of the ResourceLifecycleState do have a >> corresponding >> > > > > > > jobReady status. You point out some mistakes in the table, for >> > > > > > > example >> > > > > > > that >> > > > > > > STABLE should be NotReady; thankyou. If we put a reason >> mentioning >> > > > > > > the >> > > > > > > stable state, this would help us understand the jobStatus. >> > > > > > > >> > > > > > > I guess the jobReady is one perspective that we know is >> useful (with >> > > > > > > corrected mappings from ResourceLifecycleState and with >> reasons). >> > > > > > > Can I >> > > > > > > check that the 2 proposed conditions would also be useful >> > > > > > > additions? >> > > > > > > I >> > > > > > > assume that in your proposal when jobReady is true, then >> > > > > > > UpgradeCompleted >> > > > > > > condition would not be present and ClusterReady would always >> be >> > > > > > > true? >> > > > > > > I >> > > > > > > know conditions do not need to be orthogonal, but I wanted to >> check >> > > > > > > what >> > > > > > > your thoughts are. >> > > > > > > >> > > > > > > Kind regards, David. >> > > > > > > >> > > > > > > From: Gyula Fóra gyula.f...@gmail.com >> > > > > > > Date: Wednesday, 29 May 2024 at 15:28 >> > > > > > > To: dev@flink.apache.org dev@flink.apache.org >> > > > > > > Cc: morh...@apache.org morh...@apache.org >> > > > > > > Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX Add K8S conditions >> to >> > > > > > > Flink >> > > > > > > CRD >> > > > > > > Hi David! >> > > > > > > >> > > > > > > This change definitely warrants a FLIP even if the code >> change is >> > > > > > > not >> > > > > > > huge, >> > > > > > > there are quite some implications going forward. >> > > > > > > >> > > > > > > Looping in @morh...@apache.org morh...@apache.org for this >> > > > > > > discussion. >> > > > > > > >> > > > > > > I have some questions / suggestions regarding the condition's >> > > > > > > meaning >> > > > > > > and >> > > > > > > naming. >> > > > > > > >> > > > > > > In your proposal you have: >> > > > > > > - Ready (True/False) -> This condition is intended for >> resources >> > > > > > > which >> > > > > > > are >> > > > > > > fully ready and operational >> > > > > > > - Error (True) -> This condition can be used in scenarios >> where any >> > > > > > > exception/error during resource reconcile process >> > > > > > > >> > > > > > > The problem with the above is that the implementation does >> not well >> > > > > > > reflect >> > > > > > > this. ResourceLifecycleState STABLE/ROLLED_BACK does not >> actually >> > > > > > > mean >> > > > > > > the >> > > > > > > job is running, it just means that the resource is fully >> reconciled >> > > > > > > and >> > > > > > > it >> > > > > > > will not be rolled back (so the current pending upgrade is >> > > > > > > completed). >> > > > > > > This >> > > > > > > is mainly a fault of the ResourceLifecycleState as it doesn't >> > > > > > > capture >> > > > > > > the >> > > > > > > job status but one could argue that it was "designed" this >> way. >> > > > > > > >> > > > > > > I think we should probably have more condition types to >> capture the >> > > > > > > difference: >> > > > > > > - JobReady (True/False) -> Flink job is running (Basically job >> > > > > > > status >> > > > > > > but >> > > > > > > with transition time) >> > > > > > > - ClusterReady (True/False) -> Session / Application cluster >> is >> > > > > > > deployed >> > > > > > > (Basically JM deployment status but with transition time) >> > > > > > > - UpgradeCompleted (True/False) -> Similar to what you call >> Ready >> > > > > > > now >> > > > > > > which should correspond to the STABLE/ROLLED_BACK states and >> mostly >> > > > > > > tracks >> > > > > > > in-progress CR updates >> > > > > > > >> > > > > > > This is my best idea at the moment, not great as it feels a >> little >> > > > > > > redundant with the current status fields. But maybe thats not >> a >> > > > > > > problem >> > > > > > > or >> > > > > > > a way to eliminate the old fields later? >> > > > > > > >> > > > > > > I am not so sure of the Error status and what this means in >> > > > > > > practice. >> > > > > > > Why >> > > > > > > do we want to track the last error in 2 places? It's already >> in the >> > > > > > > status. >> > > > > > > >> > > > > > > What do you think? >> > > > > > > Gyula >> > > > > > > >> > > > > > > On Wed, May 29, 2024 at 3:55 PM David Radley < >> > > > > > > david_rad...@uk.ibm.com >> > > > > >> > > > > > > wrote: >> > > > > > > >> > > > > > > > Hi, >> > > > > > > > Thanks Lajith for raising this discussion thread under the >> Flip >> > > > > > > > title. >> > > > > > > > >> > > > > > > > To summarise the concerns from the other discussion thread. >> > > > > > > > >> > > > > > > > “ >> > > > > > > > - I echo Gyula that including some examples and further >> > > > > > > > explanations >> > > > > > > > might >> > > > > > > > ease reader's work. With the current version, the FLIP is a >> bit >> > > > > > > > hard >> > > > > > > > to >> > > > > > > > follow. - Will the usage of Conditions be enabled by >> default? Or >> > > > > > > > will >> > > > > > > > there >> > > > > > > > be any disadvantages for Flink users? If Conditions with >> the same >> > > > > > > > type >> > > > > > > > already exist in the Status Conditions >> > > > > > > > >> > > > > > > > - Do you think we should have clear rules about handling >> rules for >> > > > > > > > how >> > > > > > > > these Conditions should be managed, especially when multiple >> > > > > > > > Conditions >> > > > > > > > of >> > > > > > > > the same type are present? For example, resource has >> multiple >> > > > > > > > causes >> > > > > > > > for >> > > > > > > > the same condition (e.g., Error due to network and Error >> due to >> > > > > > > > I/O). >> > > > > > > > Then, >> > > > > > > > overriding the old condition with the new one is not the >> best >> > > > > > > > approach >> > > > > > > > no? >> > > > > > > > Please correct me if I misunderstood. >> > > > > > > > “ >> > > > > > > > >> > > > > > > > I see the Google doc link has been reformatted to match the >> Flip >> > > > > > > > template. >> > > > > > > > >> > > > > > > > To explicitly answer the questions from Jeyhun and Gyula: >> > > > > > > > - “Will the usage of Conditions be enabled by default?” >> Yes, but >> > > > > > > > this >> > > > > > > > is >> > > > > > > > just making the status content useful, whereas before it >> was not >> > > > > > > > useful. >> > > > > > > > - in terms of examples, I am not sure what you would like >> to see, >> > > > > > > > the >> > > > > > > > table Lajith provided shows the status for various >> > > > > > > > ResourceLifecycleStates. >> > > > > > > > How the operator gets into these states is the current >> behaviour. >> > > > > > > > The >> > > > > > > > change just shows the appropriate corresponding high level >> status >> > > > > > > > – >> > > > > > > > that >> > > > > > > > could be shown on the User Interfaces. >> > > > > > > > - “will there be any disadvantages for Flink users?” None , >> there >> > > > > > > > is >> > > > > > > > just >> > > > > > > > more information in the status, without this it is more >> difficult >> > > > > > > > to >> > > > > > > > work >> > > > > > > > out the status of the job. >> > > > > > > > - Multiple conditions question. The status is showing >> whether the >> > > > > > > > job >> > > > > > > > is >> > > > > > > > ready or not, so as long as the last condition is the one >> that is >> > > > > > > > shown - >> > > > > > > > all is as expected. I don’t think this needs rules for >> precedence >> > > > > > > > and >> > > > > > > > the >> > > > > > > > like. >> > > > > > > > - The condition’s Reason is going to be more specific. >> > > > > > > > >> > > > > > > > Gyula and Jeyhun, is the google doc clear enough for you >> now? Do >> > > > > > > > you >> > > > > > > > feel >> > > > > > > > you feedback has been addressed? Lajith and I are happy to >> provide >> > > > > > > > more >> > > > > > > > details. >> > > > > > > > >> > > > > > > > I wonder whether this change is big enough to warrant a >> Flip, as >> > > > > > > > it >> > > > > > > > is so >> > > > > > > > small. We could do this in an issue. WDYT? >> > > > > > > > >> > > > > > > > Kind regards, David. >> > > > > > > > >> > > > > > > > From: Lajith Koova lajith...@gmail.com >> > > > > > > > Date: Wednesday, 29 May 2024 at 13:41 >> > > > > > > > To: dev@flink.apache.org dev@flink.apache.org >> > > > > > > > Subject: [EXTERNAL] [DISCUSS] FLIP-XXX Add K8S conditions >> to Flink >> > > > > > > > CRD >> > > > > > > > Hello , >> > > > > > > > >> > > > > > > > Discussion thread here: >> > > > > > > > >> https://lists.apache.org/thread/dvy8w17pyjv68c3t962w49frl9odoz4z >> > > > > > > > to >> > > > > > > > discuss a proposal to add Conditions field in the CR status >> of >> > > > > > > > Flink >> > > > > > > > Deployment and FlinkSessionJob. >> > > > > > > > >> > > > > > > > Note : Starting this new thread as discussion thread title >> has >> > > > > > > > been >> > > > > > > > modified to follow the FLIP process. >> > > > > > > > >> > > > > > > > Thank you. >> > > > > > > > >> > > > > > > > Unless otherwise stated above: >> > > > > > > > >> > > > > > > > IBM United Kingdom Limited >> > > > > > > > Registered in England and Wales with number 741598 >> > > > > > > > Registered office: PO Box 41, North Harbour, Portsmouth, >> Hants. >> > > > > > > > PO6 >> > > > > > > > 3AU >> > > > > > > >> > > > > > > Unless otherwise stated above: >> > > > > > > >> > > > > > > IBM United Kingdom Limited >> > > > > > > Registered in England and Wales with number 741598 >> > > > > > > Registered office: PO Box 41, North Harbour, Portsmouth, >> Hants. PO6 >> > > > > > > 3AU >> > > > >> > > > Unless otherwise stated above: >> > > > >> > > > IBM United Kingdom Limited >> > > > Registered in England and Wales with number 741598 >> > > > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 >> 3AU >> > >> > >> > Unless otherwise stated above: >> > >> > IBM United Kingdom Limited >> > Registered in England and Wales with number 741598 >> > Registered office: Building C, IBM Hursley Office, Hursley Park Road, >> Winchester, Hampshire SO21 2JN >> >