Query: Geo-Redundancy with Apache Flink on Kubernetes & Replicated Checkpoints !

Sachin Thu, 03 Jul 2025 00:16:57 -0700

Dear Apache Flink Community,

I hope you're doing well.


We are currently operating a Flink deployment on *Kubernetes*, with *high
availability (HA) configured using Kubernetes-based HA services*. We're
exploring approaches for *geo-redundancy (GR)* to ensure disaster recovery
and fault tolerance across regions or clusters.

I’d like to seek the community’s insights on the following points:
*1. General Support for Geo-Redundancy in Flink*

Is there any established or recommended approach in Flink to support
*geo-redundancy* — particularly for stateful jobs with RocksDB state
backend and checkpointing to a remote object store (like S3)?

In our scenario, we plan to maintain *two geo-distributed Flink clusters*,
and in the event of a disaster, bring up the same Flink job on the standby
(DR) cluster using a *retained checkpoint* from a *replicated S3 bucket*.

Are there any known *best practices*, *pitfalls*, or *limitations* with
respect to using this model for disaster recovery?


*2. Using Replicated S3 for HA Checkpoints Across Clusters*

We are currently evaluating the use of *cross-region replicated
S3-compatible storage* to store checkpoints. Our intention is to:

   -

   Enable *automatic or manual failover* by starting the same Flink job on
   the GR cluster from the replicated checkpoint.
   -

   Use *incremental RocksDB state*, with *exactly-once semantics* ensured
   via Kafka source/sink integration.

A key concern here is:

*If the most recent checkpoint is not yet fully replicated to the second S3
cluster when a failover happens, how can we safely restore while
maintaining exactly-once semantics?*

Is there any recommendation or mechanism (either available in Flink or
implemented externally) to:

   -

   Delay marking a checkpoint as *complete/committed* until its files are
   successfully replicated, or
   -

   Restore from the *latest fully replicated and durable checkpoint* only,
   with confidence that state and offsets are consistent? In such a case what
   will happen to the exactly-once semantics?

We would be grateful for any guidance, references, or community experience
around this type of architecture.


Thanks in advance for your time and support.
Looking forward to your insights.

Best regards,
*Sachin*

Query: Geo-Redundancy with Apache Flink on Kubernetes & Replicated Checkpoints !

Reply via email to