Hi all, I have some questions about checkpoint and savepoint storage.

>From what I understand a distributed, production-quality job with a lot of
state should use durable shared storage for checkpoints and savepoints. All
job managers and task managers should access the same volume. So typically
you'd use hadoop, S3, Azure etc.

In the docs [1] it states for state.checkpoints.dir: "The storage path must
be accessible from all participating processes/nodes(i.e. all TaskManagers
and JobManagers)."

I want to understand why that is exactly. Here's my understanding:

1. The JobManager is responsible for cleaning old checkpoints, so it needs
access to all the files written out by all the task managers so it can
remove them.
2. For recovery/rescaling if all nodes share the same volume then
TaskManagers can read/redistribute the checkpoint data easily, since the
volume is shared.

Is that correct? Are there more aspects to why the directory must be shared
across the processes?

Thank you,
Rob Young

1.
https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/datastream/fault-tolerance/checkpointing/#related-config-options

Reply via email to