Hi all, I have some questions about checkpoint and savepoint storage. >From what I understand a distributed, production-quality job with a lot of state should use durable shared storage for checkpoints and savepoints. All job managers and task managers should access the same volume. So typically you'd use hadoop, S3, Azure etc.
In the docs [1] it states for state.checkpoints.dir: "The storage path must be accessible from all participating processes/nodes(i.e. all TaskManagers and JobManagers)." I want to understand why that is exactly. Here's my understanding: 1. The JobManager is responsible for cleaning old checkpoints, so it needs access to all the files written out by all the task managers so it can remove them. 2. For recovery/rescaling if all nodes share the same volume then TaskManagers can read/redistribute the checkpoint data easily, since the volume is shared. Is that correct? Are there more aspects to why the directory must be shared across the processes? Thank you, Rob Young 1. https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/datastream/fault-tolerance/checkpointing/#related-config-options