To add more details to it so that it will be clear why access to persistent object stores for all JVM processes are required for a job graph of Flink for consistent recovery. *JoB Manager:*
Flink's JobManager writes critical metadata during checkpoints for fault tolerance: - Job Configuration: Preserves job settings (parallelism, state backend) for consistent restarts. - Progress Information: Stores offsets (source/sink positions) to resume processing from the correct point after failures. - Checkpoint Counters: Provides metrics (ID, timestamp, duration) for monitoring checkpointing behavior. *Task Managers:* While the JobManager handles checkpoint metadata, TaskManagers are the workhorses during Flink checkpoints. Here's what they do: - State Snapshots: Upon receiving checkpoint instructions, TaskManagers capture snapshots of their current state. This state includes in-memory data and operator variables crucial for resuming processing. - State Serialization: The captured state is transformed into a format suitable for storage, often byte arrays. This serialized data represents the actual application state. A good network connection bandwidth is very crucial to write the large state quicker to HDFS/S3 object store from all operator states so that Task Manager could write it quickly. Oftentimes, some customers use NFS as a persistent store which is not recommended as NFS is slow and slows down the checkpointing. -A On Wed, Mar 27, 2024 at 7:52 PM Feifan Wang <zoltar9...@163.com> wrote: > Hi Robert : > > Your understanding are right ! > Add some more information : JobManager not only responsible for cleaning > old checkpoints, but also needs to write metadata file to checkpoint > storage after all taskmanagers have taken snapshots. > > ------------------------------- > Best > Feifan Wang > > At 2024-03-28 06:30:54, "Robert Young" <robertyoun...@gmail.com> wrote: > > Hi all, I have some questions about checkpoint and savepoint storage. > > From what I understand a distributed, production-quality job with a lot of > state should use durable shared storage for checkpoints and savepoints. All > job managers and task managers should access the same volume. So typically > you'd use hadoop, S3, Azure etc. > > In the docs [1] it states for state.checkpoints.dir: "The storage path > must be accessible from all participating processes/nodes(i.e. all > TaskManagers and JobManagers)." > > I want to understand why that is exactly. Here's my understanding: > > 1. The JobManager is responsible for cleaning old checkpoints, so it needs > access to all the files written out by all the task managers so it can > remove them. > 2. For recovery/rescaling if all nodes share the same volume then > TaskManagers can read/redistribute the checkpoint data easily, since the > volume is shared. > > Is that correct? Are there more aspects to why the directory must be > shared across the processes? > > Thank you, > Rob Young > > 1. > https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/datastream/fault-tolerance/checkpointing/#related-config-options > >