Hi Flink users,

We have a few Flink topologies running in production, managed by the Flink
Kubernetes Operator, and we typically deploy using ArgoCD.

Occasionally, we encounter bad deployments and need to roll back. When the
job state is not critical, we usually delete the state and restart the
Flink job, relying on Kafka to manage the offsets. In some cases, we
rollback to a specific savepoint, but managing savepoints manually has been
difficult and error-prone.

To improve this, we built a deployment verification and rollback automation
using GitHub Actions and ArgoCD APIs. Here's the high-level flow:

   -

   Read the current (previous) deployment information (savepoint location,
   version, revision, etc.).
   -

   Trigger a new deployment using ArgoCD, with a postSync job that runs
   topology-specific verification scripts.
   -

   Check whether the deployment succeeded or failed.
   -

   If successful:
   -

      Send a Slack notification with deployment details.
      -

   If failed:
   -

      Capture the new savepoint created during the failed deployment.
      -

      Verify that this savepoint is different from the previous one.
      -

      Automatically roll back by patching the deployment to use the
      previous stable savepoint.
      -

      Send a Slack notification about the rollback.

The postSync job also includes some custom validation logic for each
topology.

*My questions:*

   -

   Does this approach make sense?
   -

   Is this considered a bad practice?
   -

   Has anyone else built something similar or solved deployment
   verification and rollback in a different way?

Would love to hear your thoughts and any lessons learned.

Thanks!
-- 
Ehud Lev, Staff Engineer

Reply via email to