Hi Community, We’re building a Flink stretched cluster (on-prem) across 2.5 data centres, with the third acting as an arbiter hosting a single ZooKeeper instance. Checkpoints are stored in an S3-compatible storage (Dell ECS), deployed across two data centres with active-active geo-replication.
ECS only supports asynchronous geo-replication, meaning Flink’s PUT requests for checkpoint objects receive acknowledgements before replication is complete. In the event of a network partition or data centre failure, all jobs will be moved to a healthy data centre (whichever is running the leader JobManager). However, there’s a risk that the most recent checkpoint hasn’t been fully replicated by ECS. In that case, the JobManager might not find a consistent checkpoint to restore from. While we can retain multiple checkpoints, Flink will only attempt to recover from the latest one as referenced in ZooKeeper. Manually restarting jobs from a previous (consistently replicated) checkpoint is an option, but it breaks exactly-once semantics (for the gap between the two checkpoints) and requires manual intervention. One idea is to query the replication status for each object before committing a checkpoint, but that would require forking Flink and accepting additional latency. Are you aware of any alternative solutions? How do you ensure DR capability and handle the risk of inconsistent checkpoints in similar setups? Thank you in advance, Gergely Jahn