Re: Deployment Rollback Pattern with Flink K8S Operator — Looking for Feedback

Ehud Lev Sun, 27 Apr 2025 12:05:25 -0700

Hi Alex,

Thanks for the response!


Yes, we did consider the "Application upgrade rollbacks (Experimental)"
feature.
However, we decided not to use it mainly for two reasons:

   1.

   We wanted the flexibility to run our own custom verification logic after
   deployment.
   2.

   The "experimental" label made us concerned about potential instability
   in production environments.

Regarding the blue-green deployment feature — as far as I know, it hasn’t
been implemented yet. Please correct me if I’m wrong!
Do you know if it's getting close to being ready?

Also, based on what I described, do you think our current approach makes
sense?
Are there any pitfalls you think we might be missing?

Thanks again for your help!

On Sun, Apr 27, 2025 at 9:48 PM Alex Nitavsky <alexnitav...@gmail.com>
wrote:

> Hey,
>
> Did you consider to use the apache operator rollback feature? It can
> probably cover the basic verification needs. Generally I would consider to
> probably improve the apache operator rollback mechanism if it is not
> sufficient.
>
> If not it worth to check the blue green deployment of the operator feature
> request. We rely on similar in house mechanism to make more complex
> verifications.
>
> Regards
> Alex
>
> On Sun, 27 Apr 2025 at 20:44, Ehud Lev <ehud....@forter.com> wrote:
>
>> Hi Flink users,
>>
>> We have a few Flink topologies running in production, managed by the
>> Flink Kubernetes Operator, and we typically deploy using ArgoCD.
>>
>> Occasionally, we encounter bad deployments and need to roll back. When
>> the job state is not critical, we usually delete the state and restart the
>> Flink job, relying on Kafka to manage the offsets. In some cases, we
>> rollback to a specific savepoint, but managing savepoints manually has been
>> difficult and error-prone.
>>
>> To improve this, we built a deployment verification and rollback
>> automation using GitHub Actions and ArgoCD APIs. Here's the high-level flow:
>>
>>    -
>>
>>    Read the current (previous) deployment information (savepoint
>>    location, version, revision, etc.).
>>    -
>>
>>    Trigger a new deployment using ArgoCD, with a postSync job that runs
>>    topology-specific verification scripts.
>>    -
>>
>>    Check whether the deployment succeeded or failed.
>>    -
>>
>>    If successful:
>>    -
>>
>>       Send a Slack notification with deployment details.
>>       -
>>
>>    If failed:
>>    -
>>
>>       Capture the new savepoint created during the failed deployment.
>>       -
>>
>>       Verify that this savepoint is different from the previous one.
>>       -
>>
>>       Automatically roll back by patching the deployment to use the
>>       previous stable savepoint.
>>       -
>>
>>       Send a Slack notification about the rollback.
>>
>> The postSync job also includes some custom validation logic for each
>> topology.
>>
>> *My questions:*
>>
>>    -
>>
>>    Does this approach make sense?
>>    -
>>
>>    Is this considered a bad practice?
>>    -
>>
>>    Has anyone else built something similar or solved deployment
>>    verification and rollback in a different way?
>>
>> Would love to hear your thoughts and any lessons learned.
>>
>> Thanks!
>> --
>> Ehud Lev, Staff Engineer
>>
>>

-- 
Ehud Lev, Staff Engineer
email: ehud....@forter.com  web: www.forter.com
mobile: 052-5832253

Re: Deployment Rollback Pattern with Flink K8S Operator — Looking for Feedback

Reply via email to