Hi Alex, Thanks for the response!
Yes, we did consider the "Application upgrade rollbacks (Experimental)" feature. However, we decided not to use it mainly for two reasons: 1. We wanted the flexibility to run our own custom verification logic after deployment. 2. The "experimental" label made us concerned about potential instability in production environments. Regarding the blue-green deployment feature — as far as I know, it hasn’t been implemented yet. Please correct me if I’m wrong! Do you know if it's getting close to being ready? Also, based on what I described, do you think our current approach makes sense? Are there any pitfalls you think we might be missing? Thanks again for your help! On Sun, Apr 27, 2025 at 9:48 PM Alex Nitavsky <alexnitav...@gmail.com> wrote: > Hey, > > Did you consider to use the apache operator rollback feature? It can > probably cover the basic verification needs. Generally I would consider to > probably improve the apache operator rollback mechanism if it is not > sufficient. > > If not it worth to check the blue green deployment of the operator feature > request. We rely on similar in house mechanism to make more complex > verifications. > > Regards > Alex > > On Sun, 27 Apr 2025 at 20:44, Ehud Lev <ehud....@forter.com> wrote: > >> Hi Flink users, >> >> We have a few Flink topologies running in production, managed by the >> Flink Kubernetes Operator, and we typically deploy using ArgoCD. >> >> Occasionally, we encounter bad deployments and need to roll back. When >> the job state is not critical, we usually delete the state and restart the >> Flink job, relying on Kafka to manage the offsets. In some cases, we >> rollback to a specific savepoint, but managing savepoints manually has been >> difficult and error-prone. >> >> To improve this, we built a deployment verification and rollback >> automation using GitHub Actions and ArgoCD APIs. Here's the high-level flow: >> >> - >> >> Read the current (previous) deployment information (savepoint >> location, version, revision, etc.). >> - >> >> Trigger a new deployment using ArgoCD, with a postSync job that runs >> topology-specific verification scripts. >> - >> >> Check whether the deployment succeeded or failed. >> - >> >> If successful: >> - >> >> Send a Slack notification with deployment details. >> - >> >> If failed: >> - >> >> Capture the new savepoint created during the failed deployment. >> - >> >> Verify that this savepoint is different from the previous one. >> - >> >> Automatically roll back by patching the deployment to use the >> previous stable savepoint. >> - >> >> Send a Slack notification about the rollback. >> >> The postSync job also includes some custom validation logic for each >> topology. >> >> *My questions:* >> >> - >> >> Does this approach make sense? >> - >> >> Is this considered a bad practice? >> - >> >> Has anyone else built something similar or solved deployment >> verification and rollback in a different way? >> >> Would love to hear your thoughts and any lessons learned. >> >> Thanks! >> -- >> Ehud Lev, Staff Engineer >> >> -- Ehud Lev, Staff Engineer email: ehud....@forter.com web: www.forter.com mobile: 052-5832253