Hi Flink users, We have a few Flink topologies running in production, managed by the Flink Kubernetes Operator, and we typically deploy using ArgoCD.
Occasionally, we encounter bad deployments and need to roll back. When the job state is not critical, we usually delete the state and restart the Flink job, relying on Kafka to manage the offsets. In some cases, we rollback to a specific savepoint, but managing savepoints manually has been difficult and error-prone. To improve this, we built a deployment verification and rollback automation using GitHub Actions and ArgoCD APIs. Here's the high-level flow: - Read the current (previous) deployment information (savepoint location, version, revision, etc.). - Trigger a new deployment using ArgoCD, with a postSync job that runs topology-specific verification scripts. - Check whether the deployment succeeded or failed. - If successful: - Send a Slack notification with deployment details. - If failed: - Capture the new savepoint created during the failed deployment. - Verify that this savepoint is different from the previous one. - Automatically roll back by patching the deployment to use the previous stable savepoint. - Send a Slack notification about the rollback. The postSync job also includes some custom validation logic for each topology. *My questions:* - Does this approach make sense? - Is this considered a bad practice? - Has anyone else built something similar or solved deployment verification and rollback in a different way? Would love to hear your thoughts and any lessons learned. Thanks! -- Ehud Lev, Staff Engineer