Re: [DISCUSS] FLIP-XXX: Blue/Green Deployments for Flink on Kubernetes: Phase 1 (basic)

Sergio Chong Loo Mon, 24 Mar 2025 22:51:37 -0700

Yes thanks both Rui and Ryan, points noted!

Rui, for now we’re keeping it simple and yes the cluster needs double the 
resources during the transition time to have both Blue and Green jobs running. 
Now in order to not be blocked when we’re out of room for a 2n job, we could 
add something like an override flag that would instruct the controller to treat 
that particular deployment attempt as a regular FlinkDeployment. It’s very 
reasonable, let us analyze this.


The idea of anticipating “unfavorable” conditions can perhaps be added to the 
follow up FLIP-504

- Sergio

> On Mar 24, 2025, at 7:03 PM, Rui Fan <1996fan...@gmail.com> wrote:
> 
> Thanks Sergio for the feedback, and Ryan for the valuable input!
> 
>> During the transition we check whether the resources are ready and give
> it a timeout, we also monitor the Kubernetes events and keep track of
> anything abnormal.
> 
> Would you mind elaborating on it? What's the default timeout? And how do
> flink users adjust it?
> 
>> The first job continues its normal processing and the second job is left
> untouched so it can be examined.
> 
> I have several questions for the action after timeout:
> 1. Do you mean if the namespace or cluster doesn't have double resources,
> the new deployment never runs automatically, does it necessarily require
> human intervention?
> 2. If the job owner found the resource cause the new deployment cannot run,
> how do users handle it manually? Or my question is how to stop the old
> deployment and keep the new deployment for blue/green deployment?
> 3. Is it possible to provide an automatic mechanism to detect this case? I
> think it will greatly reduce user costs if the operator could handle it
> automatically.
> 
> Also, it's better to add these to the FLIP wiki, it will let dev and users
> to know it clearly.
> 
> Best,
> Rui
> 
> On Tue, Mar 25, 2025 at 1:21 AM Ryan van Huuksloot
> <ryan.vanhuuksl...@shopify.com.invalid> wrote:
> 
>> Hi Sergio,
>> 
>> With relation to the Kubernetes events. It would be great to tackle the
>> integration of ResourceQuota
>> <https://kubernetes.io/docs/concepts/policy/resource-quotas/>(s) into the
>> Kubernetes operator as part of this initiative. Then we would know we have
>> enough resources to double the resources and perform the upgrade.
>> We'd still need to handle cluster wide resources but from our experience
>> usually we run out of quota at the namespace much more frequently than
>> cluster wide.
>> 
>> Great FLIP though!
>> Ryan van Huuksloot
>> Sr. Production Engineer | Streaming Platform
>> [image: Shopify]
>> <https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email>
>> 
>> 
>> On Mon, Mar 24, 2025 at 12:51 PM Sergio Chong Loo
>> <schong...@apple.com.invalid> wrote:
>> 
>>> Hi Rui,
>>> 
>>> Great question, yes, that’s been taken into account.
>>> 
>>> During the transition we check whether the resources are ready and give
>> it
>>> a timeout, we also monitor the Kubernetes events and keep track of
>> anything
>>> abnormal. If the transition times out, it’s aborted, status is patched
>> and
>>> we raise the error along with the details. The first job continues its
>>> normal processing and the second job is left untouched so it can be
>>> examined.
>>> 
>>> Hope this answers your question
>>> 
>>> Thanks,
>>> - Sergio
>>> 
>>>> On Mar 20, 2025, at 7:02 PM, Rui Fan <1996fan...@gmail.com> wrote:
>>>> 
>>>> Sorry for the late response.
>>>> 
>>>> Thanks Sergio and Gyula for driving this proposal, it's really useful
>>>> for reducing the downtime when restarting or upgrading the job.
>>>> 
>>>> I have a question for this FLIP:
>>>> As the Event Sequence for a Blue/Green part mentioned in the FLIP,
>>>> the deployment A will be deleted if B is running successfully.
>>>> 
>>>> It means one job needs double resources during re-deploying, right?
>>>> If so, do we have any timeout mechanism if the resource is not enough?
>>>> 
>>>> For example, the kubernetes cluster or namespace doesn't have
>>>> any extra resources for now. Generally, if old deployment A is deleted
>>>> first, then there are enough resources to start the new deployment B.
>>>> 
>>>> If the deployment A is deleted if B is running successfully, and
>> resource
>>>> is not enough for B. It means B cannot be running successfully, and
>>>> deployment A never stops. It's like a deadlock: A is waiting for B to
>>> run,
>>>> and B is waiting for A to release resources.
>>>> 
>>>> Introducing the timeout mechanism for A means A will still stop if B is
>>>> not running within the timeout.
>>>> 
>>>> Please correct me if my understanding is wrong, thanks~
>>>> 
>>>> Best,
>>>> Rui
>>>> 
>>>> 
>>>> On Tue, Mar 11, 2025 at 10:01 PM Gyula Fóra <gyula.f...@gmail.com>
>>> wrote:
>>>> 
>>>>> I think we should proceed with the vote :)
>>>>> 
>>>>> Let me start the voting thread.
>>>>> 
>>>>> 
>>>>> On Tue, Mar 11, 2025 at 2:56 PM Sergio Chong Loo
>>>>> <schong...@apple.com.invalid> wrote:
>>>>> 
>>>>>> @Gyula,
>>>>>> 
>>>>>> Thanks for the input, I also second the “blue/green” naming
>> convention;
>>>>>> and yes none of the colors is meant to have any meaning or purpose
>>> other
>>>>>> than distinction.
>>>>>> 
>>>>>> @Alexis,
>>>>>> 
>>>>>> Indeed, so far the proposal/doc suggests a FlinkBlueGreenDeployment
>> CRD
>>>>>> 
>>>>>> Sergio
>>>>>> 
>>>>>>> On Mar 6, 2025, at 9:12 AM, Alexis Sarda-Espinosa <
>>>>>> sarda.espin...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Hi everyone,
>>>>>>> 
>>>>>>> I had also thought about this kind of functionality in the past and
>>> I'm
>>>>>>> very interested to see how it works out. I had imagined something
>> like
>>>>> a
>>>>>>> FlinkContinuousDeployment as CRD, just putting it out there.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Alexis.
>>>>>>> 
>>>>>>> On Thu, 6 Mar 2025, 17:31 Gyula Fóra, <gyula.f...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> Hi!
>>>>>>>> 
>>>>>>>> I think we should consider either FlinkAbDeployment or
>>>>>>>> FlinkBlueGreenDeployment as a name and then label deployments and
>>>>> states
>>>>>>>> with a/b or blue/green accordingly.
>>>>>>>> 
>>>>>>>> I have a slight preference for blue green as it sounds a bit nicer
>>> and
>>>>>> more
>>>>>>>> descriptive but it depends a bit whether the concept has any strong
>>>>>>>> relation with what should be the active one (does green always have
>>> to
>>>>>> be
>>>>>>>> the "new" one)?
>>>>>>>> 
>>>>>>>> In any case I think the proposal is pretty clear and we should go
>>>>> ahead
>>>>>>>> with this if there are no more discussion points from the community
>>> :)
>>>>>>>> 
>>>>>>>> I can start the vote on monday.
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Gyula
>>>>>>>> 
>>>>>>>> On Tue, Feb 11, 2025 at 4:03 PM Sergio Chong Loo
>>>>>>>> <schong...@apple.com.invalid> wrote:
>>>>>>>> 
>>>>>>>>> Hi Gyula,
>>>>>>>>> 
>>>>>>>>> Great questions, I’ll track these topics in our docs accordingly
>> as
>>>>>> well.
>>>>>>>>> 
>>>>>>>>>> - What will be the naming convention for the created
>>> FlinkDeployment
>>>>>>>> A/B?
>>>>>>>>>> Should we introduce some logic for the users to control this?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Currently, the controller takes the original resource name as the
>>>>> main
>>>>>>>>> prefix and adds the “-a” or “-b” suffixes (in an alternating
>>> fashion)
>>>>>> to
>>>>>>>>> distinguish them. We could switch this to a numeric pattern.
>>>>>>>>> 
>>>>>>>>> We could indeed allow the user to have some control on the
>>>>> deployments’
>>>>>>>>> name prefixes or even the _type_ of suffixes. Thoughts?
>>>>>>>>> 
>>>>>>>>>> - Can the user "turn" and existing FlinkDeployment into a Blue /
>>>>> Green
>>>>>>>>>> deployment?
>>>>>>>>> 
>>>>>>>>> This is a very good idea, we could introduce a “flag” in the CRD
>>> that
>>>>>>>>> would instruct the controller to treat an existing FlinkDeployment
>>> as
>>>>>> an
>>>>>>>>> “-a” type and proceed redeploying it as a Blue/Green instead.
>>>>>>>>> 
>>>>>>>>>> - Did you consider alternative names for this CR?
>>>>>>>>> 
>>>>>>>>> This is one of the most open topics, some other ideas were
>>>>>>>>> “Active/Standby” or “Rolling Deployments”… “Blue/Green” simply
>> stuck
>>>>> a
>>>>>>>> bit
>>>>>>>>> more. Any other suggestions?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Sergio
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Feb 9, 2025, at 5:17 PM, Gyula Fóra <gyula.f...@gmail.com>
>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Sergio!
>>>>>>>>>> 
>>>>>>>>>> I think this will be a great addition to the operator and is a
>>>>> feature
>>>>>>>>>> request that comes up again and again.
>>>>>>>>>> 
>>>>>>>>>> Some minor comments/question:
>>>>>>>>>> - What will be the naming convention for the created
>>> FlinkDeployment
>>>>>>>> A/B?
>>>>>>>>>> Should we introduce some logic for the users to control this?
>>>>>>>>>> - Can the user "turn" and existing FlinkDeployment into a Blue /
>>>>> Green
>>>>>>>>>> deployment?
>>>>>>>>>> - Did you consider alternative names for this CR?
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> Gyula
>>>>>>>>>> 
>>>>>>>>>> On Fri, Jan 24, 2025 at 6:00 PM Gyula Fóra <gyula.f...@gmail.com
>>> 
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Eric,
>>>>>>>>>>> 
>>>>>>>>>>> The link is fixed and the FLIP contains everything from the
>> google
>>>>>>>> doc,
>>>>>>>>> I
>>>>>>>>>>> updated the link there as well.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks
>>>>>>>>>>> Gyula
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Jan 24, 2025 at 5:55 PM Eric Xiao <
>> eric.x...@decodable.co
>>>>>>>>> .invalid>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Sergio,
>>>>>>>>>>>> 
>>>>>>>>>>>> Can you update the Phase 1 Google Doc's sharing permissions? I
>>>>> also
>>>>>>>>>>>> believe
>>>>>>>>>>>> the link in the FLIP leads to an internal Apple tool:
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> https://quip-apple.com/account/login?next=https%3A%2F%2Fquip-apple.com%2F7BpiAdeZ7Ow3
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Jan 14, 2025 at 12:15 PM Sergio Chong Loo
>>>>>>>>>>>> <schong...@apple.com.invalid> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> FLIP-503:
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=337677648
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - Sergio
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Jan 13, 2025, at 2:39 PM, Sergio Chong Loo <
>>>>>> schong...@apple.com
>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi folks,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> As proposed in [1] we would like to more formally continue
>> the
>>>>>>>>>>>>> discussion to add Blue/Green deployments support to Flink via
>>> the
>>>>>>>>>>>>> Kubernetes Operator.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> For clarity and easier review experience we’ve separated this
>>>>>>>> effort
>>>>>>>>>>>>> into 2 phases:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1) Blue/Green Deployments for Flink on Kubernetes: Phase 1
>>>>> (basic)
>>>>>>>> -
>>>>>>>>>>>>> THIS FLIP
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2) Blue/Green Deployments for Flink on Kubernetes: Phase 2
>>> (with
>>>>>>>>>>>>> Coordination) - in its corresponding FLIP/email, which will
>>>>> follow
>>>>>>>>>>>> shortly
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Phase 1 Google Doc:
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> https://docs.google.com/document/d/159I9kPmHkPMNoKp7iIgntMZjrGz5J2_svOfuaNvV5HA/edit?pli=1&tab=t.0
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks everyone in advance, we’re really excited to bring
>> this
>>>>>>>>> feature
>>>>>>>>>>>>> to the community!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> - Sergio
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [1]
>>>>>>>> https://lists.apache.org/thread/m2sqgz455fzlvp0h9kbs1zmc5gj2s162
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: [DISCUSS] FLIP-XXX: Blue/Green Deployments for Flink on Kubernetes: Phase 1 (basic)

Reply via email to