Disaster Recovery

Jahn Gergely Tue, 24 Jun 2025 15:46:10 -0700

Hi Community,

We’re building a Flink stretched cluster (on-prem) across 2.5 data centres, 
with the third acting as an arbiter hosting a single ZooKeeper instance. 
Checkpoints are stored in an S3-compatible storage (Dell ECS), deployed across 
two data centres with active-active geo-replication.


ECS only supports asynchronous geo-replication, meaning Flink’s PUT requests 
for checkpoint objects receive acknowledgements before replication is complete.

In the event of a network partition or data centre failure, all jobs will be 
moved to a healthy data centre (whichever is running the leader JobManager). 
However, there’s a risk that the most recent checkpoint hasn’t been fully 
replicated by ECS. In that case, the JobManager might not find a consistent 
checkpoint to restore from.

While we can retain multiple checkpoints, Flink will only attempt to recover 
from the latest one as referenced in ZooKeeper.
Manually restarting jobs from a previous (consistently replicated) checkpoint 
is an option, but it breaks exactly-once semantics (for the gap between the two 
checkpoints) and requires manual intervention.

One idea is to query the replication status for each object before committing a 
checkpoint, but that would require forking Flink and accepting additional 
latency.

Are you aware of any alternative solutions?
How do you ensure DR capability and handle the risk of inconsistent checkpoints 
in similar setups?

Thank you in advance,
Gergely Jahn

Disaster Recovery

Reply via email to