Re: Understanding checkpoint/savepoint storage requirements

Asimansu Bera Wed, 27 Mar 2024 23:08:28 -0700

To add more details to it so that it will be clear why access to persistent
object stores for all JVM processes are required for a job graph of Flink
for consistent recovery.
*JoB Manager:*


Flink's JobManager writes critical metadata during checkpoints for fault
tolerance:

   - Job Configuration: Preserves job settings (parallelism, state backend)
   for consistent restarts.
   - Progress Information: Stores offsets (source/sink positions) to resume
   processing from the correct point after failures.
   - Checkpoint Counters: Provides metrics (ID, timestamp, duration) for
   monitoring checkpointing behavior.


*Task Managers:*
While the JobManager handles checkpoint metadata, TaskManagers are the
workhorses during Flink checkpoints. Here's what they do:

   - State Snapshots: Upon receiving checkpoint instructions, TaskManagers
   capture snapshots of their current state. This state includes in-memory
   data and operator variables crucial for resuming processing.
   - State Serialization: The captured state is transformed into a format
   suitable for storage, often byte arrays. This serialized data represents
   the actual application state.


A good network connection bandwidth is very crucial to write the large
state quicker to HDFS/S3 object store from all operator states so that Task
Manager could write it quickly. Oftentimes, some customers use NFS as a
persistent store which is not recommended as NFS is slow and slows down the
checkpointing.

-A


On Wed, Mar 27, 2024 at 7:52 PM Feifan Wang <zoltar9...@163.com> wrote:

> Hi Robert :
>
> Your understanding are right !
> Add some more information : JobManager not only responsible for cleaning
> old checkpoints, but also needs to write metadata file to checkpoint
> storage after all taskmanagers have taken snapshots.
>
> -------------------------------
> Best
> Feifan Wang
>
> At 2024-03-28 06:30:54, "Robert Young" <robertyoun...@gmail.com> wrote:
>
> Hi all, I have some questions about checkpoint and savepoint storage.
>
> From what I understand a distributed, production-quality job with a lot of
> state should use durable shared storage for checkpoints and savepoints. All
> job managers and task managers should access the same volume. So typically
> you'd use hadoop, S3, Azure etc.
>
> In the docs [1] it states for state.checkpoints.dir: "The storage path
> must be accessible from all participating processes/nodes(i.e. all
> TaskManagers and JobManagers)."
>
> I want to understand why that is exactly. Here's my understanding:
>
> 1. The JobManager is responsible for cleaning old checkpoints, so it needs
> access to all the files written out by all the task managers so it can
> remove them.
> 2. For recovery/rescaling if all nodes share the same volume then
> TaskManagers can read/redistribute the checkpoint data easily, since the
> volume is shared.
>
> Is that correct? Are there more aspects to why the directory must be
> shared across the processes?
>
> Thank you,
> Rob Young
>
> 1.
> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/datastream/fault-tolerance/checkpointing/#related-config-options
>
>

Re: Understanding checkpoint/savepoint storage requirements

Reply via email to