Hey, Did you consider to use the apache operator rollback feature? It can probably cover the basic verification needs. Generally I would consider to probably improve the apache operator rollback mechanism if it is not sufficient.
If not it worth to check the blue green deployment of the operator feature request. We rely on similar in house mechanism to make more complex verifications. Regards Alex On Sun, 27 Apr 2025 at 20:44, Ehud Lev <ehud....@forter.com> wrote: > Hi Flink users, > > We have a few Flink topologies running in production, managed by the Flink > Kubernetes Operator, and we typically deploy using ArgoCD. > > Occasionally, we encounter bad deployments and need to roll back. When the > job state is not critical, we usually delete the state and restart the > Flink job, relying on Kafka to manage the offsets. In some cases, we > rollback to a specific savepoint, but managing savepoints manually has been > difficult and error-prone. > > To improve this, we built a deployment verification and rollback > automation using GitHub Actions and ArgoCD APIs. Here's the high-level flow: > > - > > Read the current (previous) deployment information (savepoint > location, version, revision, etc.). > - > > Trigger a new deployment using ArgoCD, with a postSync job that runs > topology-specific verification scripts. > - > > Check whether the deployment succeeded or failed. > - > > If successful: > - > > Send a Slack notification with deployment details. > - > > If failed: > - > > Capture the new savepoint created during the failed deployment. > - > > Verify that this savepoint is different from the previous one. > - > > Automatically roll back by patching the deployment to use the > previous stable savepoint. > - > > Send a Slack notification about the rollback. > > The postSync job also includes some custom validation logic for each > topology. > > *My questions:* > > - > > Does this approach make sense? > - > > Is this considered a bad practice? > - > > Has anyone else built something similar or solved deployment > verification and rollback in a different way? > > Would love to hear your thoughts and any lessons learned. > > Thanks! > -- > Ehud Lev, Staff Engineer > >