Jufang He created FLINK-37375:
---------------------------------

             Summary: Checkpoint supports the Operator to customize 
asynchronous snapshot state
                 Key: FLINK-37375
                 URL: https://issues.apache.org/jira/browse/FLINK-37375
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
    Affects Versions: 1.20.1
            Reporter: Jufang He


In some Flink task operators, slow operations such as file uploads or data 
flushing may be performed during the synchronous phase of Checkpoint. Due to 
performance issues with external storage components, the synchronous phase may 
take too long to execute, significantly impacting the job's throughput. For 
example, during our internal use of Paimon, we observed that uploading files to 
HDFS during the Checkpoint synchronous phase could encounter random HDFS slow 
node issues, leading to a substantial negative impact on task throughput.
To address this issue, I propose supporting a generic operator custom 
asynchronous snapshot feature, allowing users to move time-consuming logic to 
the asynchronous phase of Checkpoint, thereby minimizing the blocking of the 
main thread and improving task throughput. For instance, the Paimon writer 
operator could write data locally during the Checkpoint synchronous phase and 
upload files to remote storage during the asynchronous phase. Beyond the Paimon 
data upload scenario, other operator logic may also experience slow execution 
during the synchronous phase. This approach aims to uniformly optimize such 
issues.


I drafted a flip for this issue: 
[https://docs.google.com/document/d/1lwxLEQjD6jVhZUBMRGhzQNWKSvdbPbYNQsV265gR4kw]
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to